the effect of testing on student achievement, 1910–2010edmeasurement.net/mag/phelps 2012 effects...

23
International Journal of Testing, 12: 21–43, 2012 Copyright C Taylor & Francis Group, LLC ISSN: 1530-5058 print / 1532-7574 online DOI: 10.1080/15305058.2011.602920 The Effect of Testing on Student Achievement, 1910–2010 Richard P. Phelps Asheville, North Carolina This article summarizes research on the effect of testing on student achievement as found in English-language sources, comprising several hundred studies conducted between 1910 and 2010. Among quantitative studies, mean effect sizes range from a moderate d 0.55 to a fairly large d 0.88, depending on the way effects are aggregated or effect sizes are adjusted for study artifacts. Testing with feedback produces the strongest positive effect on achievement. Adding stakes or frequency also strongly and positively affects achievement. Survey studies produce effect sizes above 1.0. Ninety-three percent of qualitative studies analyzed also reported positive effects. Keywords: effect size, literature review, meta-analysis, research summary, student achievement, tests & testing HOW CAN TESTING AFFECT ACHIEVEMENT? Psychologists have been studying the effects of testing on educational achievement (and on memory) for a century. They theorize that testing affects achievement by way of certain mediating factors such as motivation, feedback, alignment, and the “pure” testing effect. Test motivation has two forms: intrinsic and extrinsic. While intrinsically moti- vated, a student may work harder or better in order to perform well on a test simply for their own satisfaction, even if that test has no stakes (i.e., consequences). While extrinsically motivated, a student may work harder or better in order to perform well on a test with stakes such as course completion, certification, graduation, or grade promotion. Test feedback ranges from simple awareness of test results (that measure progress or performance) to overt remediation. Feedback can be diagnostic. Correspondence should be sent to Dr. Richard P. Phelps, PhD, One North Pack Square, Suite 301, Asheville, NC 28801, USA. E-mail: [email protected] Downloaded by [University of Minnesota Libraries, Twin Cities] at 18:49 16 January 2013

Upload: others

Post on 07-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

International Journal of Testing 12 21ndash43 2012Copyright Ccopy Taylor amp Francis Group LLCISSN 1530-5058 print 1532-7574 onlineDOI 101080153050582011602920

The Effect of Testing on StudentAchievement 1910ndash2010

Richard P PhelpsAsheville North Carolina

This article summarizes research on the effect of testing on student achievement asfound in English-language sources comprising several hundred studies conductedbetween 1910 and 2010 Among quantitative studies mean effect sizes range froma moderate d asymp 055 to a fairly large d asymp 088 depending on the way effects areaggregated or effect sizes are adjusted for study artifacts Testing with feedbackproduces the strongest positive effect on achievement Adding stakes or frequencyalso strongly and positively affects achievement Survey studies produce effect sizesabove 10 Ninety-three percent of qualitative studies analyzed also reported positiveeffects

Keywords effect size literature review meta-analysis research summary studentachievement tests amp testing

HOW CAN TESTING AFFECT ACHIEVEMENT

Psychologists have been studying the effects of testing on educational achievement(and on memory) for a century They theorize that testing affects achievement byway of certain mediating factors such as motivation feedback alignment and theldquopurerdquo testing effect

Test motivation has two forms intrinsic and extrinsic While intrinsically moti-vated a student may work harder or better in order to perform well on a test simplyfor their own satisfaction even if that test has no stakes (ie consequences) Whileextrinsically motivated a student may work harder or better in order to performwell on a test with stakes such as course completion certification graduation orgrade promotion

Test feedback ranges from simple awareness of test results (that measureprogress or performance) to overt remediation Feedback can be diagnostic

Correspondence should be sent to Dr Richard P Phelps PhD One North Pack Square Suite 301Asheville NC 28801 USA E-mail richardpphelpsyahoocom

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

22 PHELPS

Alignment occurs when the content embedded in a test or the level of per-formance demanded by a test matches the content or intensity of an associatedprescribed curriculum Naturally the more closely aligned the curriculum is to thetest the more likely students who have mastered the content will perform well onthe test1

The pure testing effect is an increase in achievement that occurs simply becausestudents take a test instead of spending the same amount of time some other waysuch as studying The most prominent aspect of the pure testing effect appears to bethe generation effectmdashstudents taking a test cannot passively absorb informationas they might listening to lectures or reading textbooks they must ldquogeneraterdquothe information themselves and that process apparently makes a more durableimpression on memory (Bertsch Pesta Wiscott amp McDaniel 2007)

METHOD

This article summarizes the research literature on the effect of testing on studentachievement from 1910 to 2010 as found in English-language sources

Searches

Two search methodsmdashkeyword searches and citation chainsmdashwere employed Akeyword search is standard in scholarly research First one identifies relevant doc-ument databases and research indexes and relevant keywords that should identifydocuments containing relevant content Then one lets the computer do the workfinding and compiling a list of potential sources In some pure research fieldswhere one could reasonably expect to find the largest proportion of relevant stud-ies in a small selection of scholarly journals a keyword search might be thoroughenough to cover the topic reasonably well For this study more than 40 searchterms were employed (eg test exam examination assessment accountabilitycompetency effect consequences impact benefit cost)

With citation chains one uses the studies found in a keyword search to findrelated studies Each study provides evidence of other relevant studies inside itstext citations or references This is sometimes called the ldquoancestry methodrdquo

To be truly thorough any literature search should follow citation chains but themethod is particularly important to use for a topic like this one First an interestin testing effects overlaps many research field boundaries such as education psy-chology public policy and sociology and many more subfield boundaries Secondtests are often developed by private companies and sponsored by governmentsneither of which is commonly interested in publishing in scholarly journals Thirdsome types of studies simply cannot be found through simple keyword searchesThey include proprietary work audits performed by governmental agencies and

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 23

program evaluations sponsored by them most studies conducted during the firsttwo thirds of the twentieth century and masterrsquos theses Fourth a large proportionof studies that measured testing effects were focused primarily on other resultsand it is those foci and those results that determined which keywords were cho-sen for the research data bases When testing effect findings are unplanned orcoincidental to a study a computer search on typical keywords probably will notfind it Fifth a reliance on keyword search can be inadequate to the task due tovariation in keywords across fields different research disciplines employ differentvocabularies A ldquonet benefitrdquo to an economist for example may be called ldquoconse-quential validityrdquo by psychometricians ldquopositive effectsrdquo by program evaluatorsor ldquopositive washbackrdquo by education planners

Searches for this study were limited only by the English language If onesource led to another via citation or reference that lead was followed But indexeswere used systematically too Those indexes included the Education ResourcesInformation Clearinghouse (ERIC) FirstSearch of the Online Computer LibraryCenter (OCLC) which includes PsychLit and hundreds of other social sciencedata bases Dissertation Abstracts several EBSCO data bases Google GoogleScholar Yahoo Search and several individual librariesrsquo catalogues

In addition to all the aforementioned sources two lists unique to opinion pollswere consulted the Roper Center Public Opinion Online and Polling the Nationsdatabases

A complete list of the studies reviewed and their bibliographic referencesis posted at httpnpeeducationnewsorgReviewResourceslowastlowasthtm substitutingeither ldquoQuantitativeListrdquo ldquoSurveyListrdquo or ldquoQualitativeListrdquo for ldquolowastlowastrdquo

Geographic Coverage

No attempt was made to limit the search by geographic region Naturally howeverlimiting the search to English-language documents biases a search in favor of thosewhere the English language is predominantmdashle monde Anglo-Saxon GrantedEnglish may be used as the language of scientific communication even in countrieswhere another is spoken But the search for this study was not restricted to journalarticles and other types of documents such as government reports and conferencepapers are more likely to be written in a home language

In addition because the search included any opportunist discoveries madein library catalogues and thesis indexes and only libraries in the United Stateswere visited a bias toward United States sources is introduced Finally the publicopinion poll indexes were limited exclusively to US and Canadian sources

Overall 81 of the studies included in the analysis had a primarily US focusbut the geographic coverage varies by methodology type Whereas 89 of thequantitative studies had a US focus only 65 of the qualitative studies did Theseproportions for the United States might seem disproportionately high but the

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

24 PHELPS

United States does in fact host over two-thirds of the English-speaking populationof the industrialized world2

Perhaps in part due to inclusion of the uniquely North American public opinionpoll data 95 of the survey studies had a US focus Indeed there were so fewsurvey studies (N = 5) from outside the United States and Canada they weredropped from the analysis making the analysis of survey studies exclusivelyNorth American

Over 3000 documents were found that potentially contained evidence of testingeffects on achievement Less than one third qualified for inclusion in the analysishowever The other 2000 or so were carefully reviewed found not to containrelevant or sufficient evidence and set aside3 Titles of several hundred morepotentially useful sources fill a do-list of documents waiting to be processed in apotential next stage of this research

Study Types

Searching uncovered a centuryrsquos worth of studies measuring the effects of testing-related intervention on student achievement This broad reach captures quite a va-riety of studies including many of the historically most typicalmdashrelatively smallfocused randomized experiments with undergraduate psychology classesmdashandthe recently popularmdashhugely aggregated state- or nationwide multivariate data-crunchings of accountability regimes with hundreds of independent variables con-tributing explanation obfuscation or both

Quantitative studies One hundred seventy-seven quantitative researchstudies were reviewed that include 640 separate measurements of effects Somestudies report multiple relevant measures of effects (eg measured at differenttimes with different subgroups under different conditions) All told the 640measures comprise information for close to 7 million separate individuals Quan-titative studies directly measure effects and employ for example regression anal-ysis structural equation modeling pre-post comparison experimental design orinterrupted time series design

These quantitative studies manifest a variety of study designs the most numer-ous being straightforward experiments (or quasi-experiments) comparing means(or mean gains or proportions) of a treatment and a control group For exam-ple an experiment might randomly assign students to two different classroomsadminister a baseline test and then test the two groups differently throughout acourse say giving one group a mid-term test and the other group no mid-term testThen a final test could be administered and the pre-post gain scores compared todetermine the effect of giving a mid-term test4

Typically the studies compared groups (or the same [or similar] group at differ-ent time points) facing different tests or test-related situations One group might

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 25

have been tested more frequently than the other tested with higher stakes (iegreater consequences) or provided one of two types of feedbackmdashsimply beingmade aware of their test performance or course progress or given remediation oranother type of corrective action based on their test results

In other studies the experimental group was told that a test would be countedas part of their grade and the control group was told that it would not In a fewexperiments one group was given take-home tests during the course and theother group in-class tests On the final exam the in-class tested students tended toperform better perhaps because the take-homendashtested students limited their reviewof the material to the topics on the take-home tests whereas the in-classndashtestedstudents prepared themselves to be tested on all topics

A small minority of studies employed pre-post comparisons of either the samepopulation (eg 4th-graders one year and 4th-graders in the same jurisdiction alater year) the same cohort (eg 4th-graders followed longitudinally from oneyear to a later year) or a synthetic cohort (eg a sample of 4th-graders one yearcompared to a sample of 8th-graders in the same jurisdiction four years later)

Historically effect-of-testing research has accompanied interest in one or an-other of the specific types of interventions Mastery testing studies for examplewere frequent from the 1960s through the 1980s Accountability studies coincidedwith the popularity (in the United States at least) of ldquominimum competencyrdquo test-ing from the 1970s on Frequency-of-testing studies were most popular in the firsthalf of the twentieth century Memory retention studies were frequently conductedthen too but have also encountered resurgence in popularity in recent years

Table 1 describes this collection of quantitative studies by various attributesincluding study design and the source sponsor and scale of tests used in thestudies

Survey studies Surveys are a type of quantitative study that measures per-ceptions of effectsmdasheither through public opinion polls or surveys of groupsselected to complete a questionnaire as part of a program evaluation Some surveystudies contain multiple relevant items andor posed relevant questions to multiplerespondent groups (eg teachers parents) Two hundred forty-seven survey stud-ies conducted in the United States and Canada were reviewed summarizing about700000 separate individual responses to questions regarding testingrsquos effect onlearning and instruction and preferences for common test types

All told 813 individual item-response group combinations (hereafter calledldquoitemsrdquo) were identified For example if a polling firm posed the same question totwo different respondent groups (eg teachers parents) each item-response groupcombination was counted as a separate item As was inevitable some individualrespondents are represented more than once but never for the same item

Figure 1 breaks out survey items by data sourcemdashpublic opinion poll or surveyembedded in a program evaluationmdashand respondent groupmdasheducation provider

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

26 PHELPS

TABLE 1Source Sponsor Study Design and Scale of Tests in Quantitative Studies

Number of Studies Percent of Studies

Source of TestResearcher or Teacher 87 54Commercial 38 24National 24 15State or District 11 7Total 160 100

Sponsor of TestLocal 99 62National 45 28State 11 7International 5 3Total 160 100

Study DesignExperiment Quasi-experiment 107 67Multivariate 26 16Pre-post 12 8Pre-post (with shadow test) 8 5Experiment Posttest only 7 4Total 160 100

Scale of Test AdministrationClassroom 115 72Large-scale 39 24Mid-scale 6 4Total 160 100

or education consumer The education provider group comprises administratorsboard members teachers counselors and education professors The educationconsumer group comprises the public parents students employers politiciansand tertiary education faculty

Figure 1 reveals a fairly even division of sources among the 800+ surveyitems between public opinion polls (55) and program evaluation surveys (45)Likewise an almost even division of respondent group types between educationproviders (48) and education consumers (52) was found

Among the subtotals however asymmetries emerge Almost five times asmany evaluation survey items were posed to education providers than to educationconsumers Conversely almost three times as many public opinion poll items wereposed to education consumers than to education providers

Table 2 reveals that a majority of the survey items (62) concern high-stakestests (ie tests with consequences for students teachers or schools such asretention in grade [for a student] or licensure denial [for a teacher]) This stands to

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 27

05

10

1520253035

404550

Public Opinion Polls Program EvaluationSurveys

Per

cen

t EducationProviders

EducationConsumers

FIGURE 1Percentage of survey items By respondent group and type of survey

Note lowastThree program evaluation survey items were posed to a combinedprovider-consumer group and for the purpose of this figure those three items have been

counted twice

reason as high-stakes tests are more politically controversial and more familiar tothe public so pollsters and program evaluators are more likely to ask about them

Table 2 also classifies the survey items by the target of the stakesmdashthe groupor groups bearing the consequences of good or poor test performance For aplurality of items students bore the consequences (46) Schools (33) and

TABLE 2Number and Percent of Survey Items By Test Stakes and Their Target Group

Number of Survey Items Percent of Survey Items

Level of StakesHigh 507 62Medium 184 23Unknown 89 11Low 33 4Total 813 100

Target GroupStudents 393 46Schools 281 33Teachers 116 14No Stakes 64 7Total 854 100

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

28 PHELPS

teachers (14) were less often the target of test stakes In some cases stakes wereapplicable to more than one group

Qualitative studies Any study for which an effect size could not be com-puted was designated as qualitative Typically qualitative studies declare evidenceof effects in categorical terms and incorporate more ldquohands onrdquo research methodssuch as direct observations site visits interviews or case studies Here analysis of244 qualitative studies focuses exclusively on the reported direction of the testingeffect on achievement

To determine whether a qualitative study indicated a positive neutral or nega-tive effect of testing on achievement a specific qualitative meta-analysis approachcalled template analysis was employed (eg Crabtree amp Miller 1999 King 19982006 Miles amp Huberman 1994) A hierarchical coding rubric was developed tosummarize and organize the studies according to themes predetermined as im-portant by previous research and informed by ongoing analysis of the data setThe template establishes broad themes (eg stakes) initially then successivelysmaller themes (eg high medium or low) are generated within each broadtheme The template rubric comprised 6 themes and 27 subthemes identified inTable 3

All studies were reviewed and coded by three reviewers a doctoral student inmeta-analysis an editor and me Any disparities in coding were discussed amongreviewers in an attempt to reach a consensus In less than 10 cases in which a clearconsensus was not reached the author made the coding decisions

Qualitative studies used various research methods and in various combinationsIndeed 37 of the 245 studies incorporated more than one method (Table 4)

Table 5 categorizes the qualitative studies according to the level of educationand the scale stakes and target of the stakes of the tests involved Most (81)studies included in this analysis were large-scale while 17 of the studies wereat the classroom level Two percent of the studies focused on testing for teachers

Summary Table 6 lists the overall numbers of studies and effects bytypemdashquantitative survey and qualitativemdashand geographic coveragemdashUS ornon-US Several studies fit more than one category (eg a program evaluationwith both survey and observational case study components)

Calculating Effects

For quantitative and survey studies effect sizes were calculated and the datasummarized with weighted and unweighted means Qualitative studies that reacheda conclusion about the effect of testing on student achievement or on instructionwere tallied as positive negative and in-between

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 29

TABLE 3Coding Themes and Subthemes Used in Template Analysis

Themes Subthemes

Scale of Test bull Large-Scalebull Classroombull Test for Teachersbull Not Specified

Stakes of Test bull Highbull Mediumbull Lowbull Variesbull Not Specified

Target of Stakes bull Studentbull Teacherbull School

Research Methods bull Case Studybull InterviewFocus Groupbull JournalWork Logbull RecordsDocument Reviewbull Research Reviewbull ExperimentPre-Post Comparison for which No

Effect Size Was Reportedbull Survey for which No Effect Size Was Reported

Rigor of Qualitative Study bull Highbull Mediumbull Low

Direction of Testing Effects bull Positivebull Positive Inferredbull Mixedbull No Changebull Negative

TABLE 4Research Method Used in Qualitative Studies

ThemesSubthemes Number of Studies Percent of Studies

Case Study 120 43Interview (Includes Focus Group) 75 27Records or Document Review 33 12Survey 22 8Experiment or Pre-Post Comparison 21 7Research Review 8 3Journals or Work Logs 2 1Total 281 101

Note Percentage sums to greater than 100 due to rounding

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

30 PHELPS

TABLE 5Level of Education Scale Stakes and Target of Stakes in Qualitative Studies

Number of Studies Percent of Studies

Level of EducationTwo or More Levels 115 50Upper Secondary 55 24Elementary 34 15Postsecondary 18 8Lower Secondary 7 3Adult Education 3 1Total 232 101

Scale of TestLarge-scale 196 81Classroom 42 17Test for Teachers 5 2Total 243 100

Stakes of TestHigh 154 63Low 51 21Medium 33 14VariesNot Specified 6 3Total 244 101

Target of StakesStudent 150 62School 80 33Teacher 10 4Varies 1 lt1Total 241 100

Note Percentages may sum to greater than 100 due to rounding

TABLE 6Number of Studies and Effects By Type

Type of Number of Number of Population Coverage Percent of StudiesStudy Studies Effects (in thousands) with US Focus

Quantitative 177 640 7000 89Surveys amp Polls 247 813 700 95Qualitative 245 245 unknown 65Total 669 1698

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 31

Several formulae are employed to calculate effect sizes each selected to alignto its relevant study design The most frequently used is some variation of thestandardized mean difference (Cohen 1988)

d = XT minus XC

sp(1)

where

XT = number in treatment groupXC = number in control groupSp = pooled standard deviation andd = effect size

For those studies that compare proportions rather than means an effect sizecomparable to the standard mean difference effect size is calculated with a log-oddsratio and a logit function transformation (Lipsey amp Wilson 2001 pp 53ndash58)

ES =(

loge

[PT

1 minus PT

]minus loge

[PC

1 minus PC

])183 (2)

PT is the proportion of persons in the treatment group with ldquosuccessfulrdquo out-comes (eg passing a test graduating) and PC is the proportion of persons incontrol group with successful outcomes The logit function transformation is ac-complished by dividing the log-odds ratio by 183

For survey studies the response pattern for each survey is quantitatively sum-marized in one of two ways

bull as frequencies for each point along a Likert scale which can then be summarizedby standard measures of central tendency and dispersion or

bull as frequencies for each multiple choice response option which can then beconverted to percentages

Just over 100 of the almost 800+ survey items reported their responses inmeans and standard deviations on a scale Effects for scale items are calculatedas the difference of the response mean from the scale midpoint divided by thestandard deviation

Close to 700 other survey items with multiple-choice response formats reportedthe frequency and percentage for each choice The analysis collapses these multiplechoices into dichotomous choicesmdashyes and no for and against favorable andunfavorable agree and disagree support and oppose etc Responses that fit neithersidemdashneutral no opinion donrsquot know no responsemdashare not counted An effect

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

32 PHELPS

size is then calculated with the aforementioned log-odds ratio and a logit functiontransformation

Weighting Effect sizes are summarized in two ways with an unweightedmean in which case each effect size counts the same regardless the size of thepopulation under study and with a weighted mean in which case effect sizemeasures calculated on larger groups count for more

For between 10 and 20 of the quantitative studies (depending on how oneaggregates them) that incorporated pre-post comparison designs weights (ieinverse variances) are calculated thus (Lipsey amp Wilson 2001 p 72)

w = 2N

4(1 minus r) + d2(3)

where

N = number in study populationr = correlation coefficient andd = effect size

For the remainder of the studies with standardized mean difference effectsizes weights (ie inverse variances) are calculated thus (Lipsey amp Wilson 2001p 72)

w = 2 times NT times NC(NT + NC)

2(NT + NC)2 + NT times NC times d2(4)

where

NT = number in treatment groupNC = number in control group andd = effect size

For the group of about 100 survey items with Likert-scale responses withstandardized mean effect sizes standard errors and weights (ie inverse variances)are calculated thus

SE = sradicn

w = n

s2(5 6)

wheres = standard deviation andn = number of respondents

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 33

For the group of about 700 items with proportional-difference responses withlog-odds-ratio effect sizes standard errors and weights are calculated thusly

w = 1

SE 2SE =

radic1

a+ 1

b+ 1

c+ 1

d(7 8)

where

a = number of respondents in favorb = number of respondents not in group ac = number of respondents against andd = number of respondents not in group c

Once the weights and standard errors are calculated effect sizes are thenldquoweightedrdquo by the following process (Lipsey amp Wilson 2001 pp 130ndash132)

1 Each effect size is multiplied by its weight wes2 The weighted mean effect size is calculated thusly

ES =sum

wesisumwi

(9)

Standards for judging effect sizes Cohen (1988) proposed that a d lt

020 represented a small effect and a d gt 080 represented a large effect Avalue for d in between then represented a medium effect In his meta-analysis ofmeta-analyses however Hattie (2009) placed the average d for all meta-analysesof student achievement effect studies near 040 with a pseudo-Hawthorne Effectbase level around 010 (ie the level at which any achievement-related interventionnot producing a clear positive or negative effect will settle)

RESULTS

Quantitative Studies

For quantitative studies simple and weighted effect sizes are calculated for sev-eral different aggregations (eg same treatment group control group or studyauthor) Mean effect sizes range from a moderate d asymp 055 to a fairly large d asymp088 depending on the way effects are aggregated or effect sizes are adjusted forstudy artifacts Testing with feedback produces the strongest positive effect on

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

34 PHELPS

achievement Adding stakes or testing with greater frequency also strongly andpositively affects achievement

Studies are classified with dozens of moderators including size of study pop-ulation scale of test administration responsible jurisdiction (eg nation stateclassroom) level of education and primary focus of study (eg memory reten-tion testing frequency mastery testing accountability program)

For the full panoply of 640 different effects the ldquobare bonesrdquo mean d asymp055 After adjustments for small sample sizes (Hedges 1981) and measurementuncertainty in the dependent variable (Hunter amp Schmidt 2004 chapter 8) themean d asymp 071 an apparently stronger effect5

For different aggregations of studies adjusted mean effect sizes vary Groupingstudies by the same treatment or control group or content with differing outcomemeasures (N asymp 600) produces a mean d asymp 073 Grouping studies by the sametreatment or control group with differing content or outcome measures (N asymp 430)produces a mean d asymp 075 Grouping studies by the study author such that a meanis calculated across all of a single authorrsquos studies (N = 160) produces a meand asymp 088

To illustrate the various moderatorsrsquo influence on the effect sizes the same-study-author aggregation is employed Studies in which the treatment group testedmore frequently than the control group produced a mean d asymp 085 Studies in whichthe treatment group was tested with higher stakes than the control group produceda mean d asymp 087 Studies in which the treatment group was made aware of theirtest performance or course progress (and the control group was not) produced amean d asymp 098 Studies in which the treatment group was remediated or receivedsome other type of corrective action based on their test performance produced amean d asymp 096 (See Table 7)

Note that these treatments such as testing more frequently or with higherstakes need not be mutually exclusive can be used complementarily and oftenare This suggests that the optimal intervention would not choose among theseinterventionsmdashall with strong effect sizesmdashbut rather use as many of them aspossible at the same time

TABLE 7Mean Effect Sizes for Various Treatments

Treatment Group Mean Effect Size

is made aware of performance and control group is not 098

receives targeted instruction (eg remediation) 096

is tested with higher stakes than control group 087

is tested more frequently than control group 085

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 35

TABLE 8Source Sponsor Study Design and Scale in Quantitative Studies

Number of Studies Mean Effect Size

Source of TestResearcher or Teacher 87 093National 24 087Commercial 38 082State or District 11 072Total 160

Sponsor of TestInternational 5 102Local 99 093National 45 081State 11 064Total 160

Study DesignPre-post 12 097Experiment Quasi-experiment 107 094Multivariate 26 080Experiment Posttest only 7 060Pre-post (with shadow test) 8 058Total 160

Scale of AnalysisAggregated 9 160Small-scale 118 091Large-scale 33 057Total 160

Scale of Test AdministrationClassroom 115 095Mid-scale 6 072Large-scale 39 071Total 160

Mean effect sizes can be compared between other pairs of moderators too (seeTable 8) For example small population studies such as experiments produce anaverage effect size of (091) whereas large population studies such as those usingnational populations produce an average effect size of (057) Studies of small-scale test administrations (092) produce a larger mean effect size than do studies oflarge-scale test administrations (078) Studies of local test administrations producea larger mean effect size (093) than do studies of state test administrations (064)

Studies conducted at universities (which tend to be small-scale) produce a largermean effect size (102) than do studies conducted at the elementary-secondarylevel (076) (which are often large-scale) Studies in which the intervention occurs

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

36 PHELPS

before the outcome measurement produce a larger mean effect size (093) thando ldquobackwashrdquo studies for which the intervention occurs afterwards (040) (egwhen an increase in lower-secondary test scores is observed after the introductionof a high-stakes upper-secondary examination) Studies in which the subject mattercontent of the intervention and outcome are aligned produce a larger mean effectsize (091) than do studies in which they are not aligned (078)

The quantitative studies vary dramatically in population size (from just 16 to11 million) and their effect sizes tend to correlate negatively The mean for thesmallest quintile among the full 640 effects for example is 104 whereas thatfor the largest quintile is 030 Starting with the adjusted overall mean effect sizeof 071 (for all 640 effects) the mean grows steadily larger as the studies withthe largest populations are removed For all studies with populations less than100000 the mean effect size is 073 under 10000 080 under 1000 083 under500 88 under 100 92 and under 50 106

Generally weighting reduces mean effect sizes further suggesting that thelarger the study population the weaker the results For example the unweightedeffect size of 030 for the largest study population quintile compares to its weightedcounterpart of 028 For the smallest study population quintile the effect size is104 unweighted but only 079 weighted

Survey Studies

Extracting meaning from the 800+ separate survey items required condensationFirst survey items were separated into two overarching groups in which theperception of a testing effect is either explicit or inferred Within the explicitor directly observed category survey items were further classified into two itemtypes those related to improving learning or to improving instruction

Among the 800+ individual survey items analyzed few were worded exactlylike others Often the meaning of two different questions or prompts is verysimilar however even though the wording varies Consider a question posed toUS teachers in the state of North Carolina in 2002 ldquoHave students learned morebecause of the [testing program]rdquo Eighty-three percent responded favorably and12 unfavorably while 5 were neutral (eg no opinion donrsquot know) Thissurvey item is classified in the ldquoimprove learningrdquo item group

Now consider a question posed to teachers in the Canadian Province of Albertain 1990 ldquoThe diploma examinations have positively affected the way in which Iteachrdquo Forty-one percent responded favorably 32 unfavorably and the remain-ing were neutral (or unresponsive) This survey item is classified in the ldquoimproveinstructionrdquo item group

Second the many survey items for which the respondentsrsquo perception of a test-ing effect was inferred were separated into several item groups including ldquoFavorgrade promotion examsrdquo ldquoFavor graduation examsrdquo ldquoFavor teacher competency

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 37

examsrdquo ldquoFavor teacher recertification examsrdquo ldquoFavor more testingrdquo Hold schoolsaccountable for studentsrsquo scoresrdquo and ldquoHold teachers accountable for studentsrsquoscoresrdquo

A perception of a testing effect is inferred when an expression of support for atesting program derives from a belief that the program is beneficial Conversely abelief that a testing program is harmful is inferred from an expression of opposition

For the most basic effects of testing (eg improves learning or instruction) andthe most basic uses of tests (eg graduation certification) effect sizes are largein many cases more than 10 and in some cases more than 20 Effect sizes areweaker for situations in which one group is held accountable for the performanceof anothermdashholding either teachers or schools accountable for student scores

The results vary at most negligibly with study level of rigor or test stakesAnd the results are overwhelmingly positive whether the survey item focusedon testingrsquos affect on improving learning or improving instruction Some of thestudy results can be found in the following tables Table 9 lists the unweightedand weighted effect sizes by item group for the total sample respondent popula-tion With the exception of the item group ldquohold teachers accountable for studentscoresrdquo effect sizes are positive for all item groups The most popular item groupsappear to be ldquoFavor teacher recertification examsrdquo ldquoFavor teacher competency ex-amsrdquo ldquoFavor (student) graduation examsrdquo and ldquoFavor (student) grade promotionexamsrdquo

Effect sizesmdashweighted or unweightedmdashfor these four item groups are all pos-itive and large (ie gt 08) (Cohen 1988) The effect sizes for other item groupssuch as ldquoimprove learningrdquo and ldquoimprove instructionrdquo also are positive and largewhether weighted or unweighted The remaining two item groupsmdashldquofavor moretestingrdquo and ldquohold schools accountable for (student) scoresrdquomdashgarner moderatelylarge positive effect sizes whether weighted or unweighted

Table 8 also breaks out the effect size results by two different groups of re-spondents education providers and consumers Education providers comprise pri-marily administrators and teachers whereas education consumers comprise mostlystudents and noneducator adults

As groups providers and consumers tend to respond similarlymdashstrongly andpositivelymdashto items about test effects (tests ldquoimprove learningrdquo and ldquoimproveinstructionrdquo) Likewise both groups strongly favor high-stakes testing for students(ldquofavor grade promotion examsrdquo and ldquofavor graduation examsrdquo)

Provider and consumer groups differ in their responses in the other item groupshowever For example consumers are overwhelmingly in favor of teacher testingboth competency (ie certification) and recertification exams with weighted effectsizes of 22 and 20 respectively Provider responses are positive but not nearly asstrong with weighted effect sizes of 03 and 04 respectively

Clear differences of opinion between provider and consumer groups can beseen among the remaining item groups too Consumers tend to favor more testing

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

TAB

LE9

Effe

ctS

izes

ofS

urve

yR

espo

nses

By

Item

Gro

upF

orTo

talP

opul

atio

nE

duca

tion

Pro

vide

rsa

ndE

duca

tion

Con

sum

ers

1958

ndash200

81

Tota

lE

duca

tion

Pro

vide

rs2

Edu

cati

onC

onsu

mer

s3

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofE

ffec

tSiz

eE

ffec

tSiz

eN

umbe

rof

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofIt

emTy

pe(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s

Test

ing

Impr

oves

Lea

rnin

g1

30

524

21

20

514

41

30

698

Test

ing

Impr

oves

Inst

ruct

ion

10

07

165

10

06

125

12

10

40Te

stin

gA

lign

sIn

stru

ctio

n0

60

028

Favo

rG

rade

Pro

mot

ion

Exa

ms

13

11

811

21

026

13

11

53Fa

vor

Gra

duat

ion

Exa

ms

14

11

121

12

08

381

41

281

Favo

rM

ore

Test

ing

05

07

72minus0

20

119

07

08

53Fa

vor

Teac

her

Com

pete

ncy

Exa

ms

21

08

611

10

317

25

22

44Fa

vor

Teac

her

Rec

erti

fica

tion

Exa

ms

21

15

222

52

018

Hol

dTe

ache

rsA

ccou

ntab

lefo

rS

core

sminus0

10

031

minus06

minus05

180

60

513

Hol

dS

choo

lsA

ccou

ntab

lefo

rS

core

s0

60

511

10

2minus0

338

09

07

73

Not

es1

Bla

nkce

lls

repr

esen

tnum

bers

ofit

ems

too

smal

lto

repo

rtva

lidl

y2

Edu

cati

onpr

ovid

ers

are

educ

atio

nad

min

istr

ator

spr

inci

pals

tea

cher

sed

ucat

ion

agen

cyof

fici

als

and

univ

ersi

tyfa

cult

yin

educ

atio

npr

ogra

ms

3E

duca

tion

cons

umer

sar

est

uden

tsp

aren

tst

hege

nera

lpu

blic

em

ploy

ers

publ

icof

fici

als

not

wor

king

ined

ucat

ion

agen

cies

and

univ

ersi

tyfa

cult

you

tsid

eof

educ

atio

n

38

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 39

TABLE 10The Effect of Testing on Student Achievement in Qualitative Studies

Direction of Effect Number of Studies Percent of Studies Percent wo Inferred

Positive 204 84 93Positive Inferred 24 10No Change 8 3 4Mixed 5 2 2Negative 3 1 1Total 244 100 100

and holding teachers and schools accountable for student test scores Providerseither are less supportive in these item groups or are opposed For exampleprovidersrsquo weighted effect size for ldquohold teachers accountable for students scoresrdquois a moderately large ndash05

Qualitative Studies

Analysis of the qualitative studies focuses on the direction of the testing effecton achievement or instruction Ninety-three percent of the qualitative studiesanalyzed reported positive effects of testing whereas only 7 reported mixedeffects negative effects or no change

In 24 cases a positive effect on student achievement was not declared butreasonably could be inferred from statements about behavioral changes Onestudy for example reported ldquo[using] results from test to improve courseworkrdquoIf coursework was indeed improved one might reasonably assume that studentachievement benefited as well Whether the ldquopositive inferredrdquo studies are includedin the summary or not however the counts indicate that far more studies findpositive than mixed negative or no effects The main results of this researchsummary are displayed in Table 10

DISCUSSION

One hundred yearsrsquo evidence suggests that testing increases achievement Effectsare moderately to strongly positive for quantitative studies and are very stronglypositive for survey and qualitative studies

The overwhelmingly positive results of the qualitative research review in partic-ular may surprise some readers The results should be considered reliable becausethe base of evidence is so largemdash245 studies conducted over the course of acentury in more than 30 countries

Qualitative studies have been held up by testing opponents as a higher standardfor studies of educational impact Labeling quantitative studies as narrow limited

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 2: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

22 PHELPS

Alignment occurs when the content embedded in a test or the level of per-formance demanded by a test matches the content or intensity of an associatedprescribed curriculum Naturally the more closely aligned the curriculum is to thetest the more likely students who have mastered the content will perform well onthe test1

The pure testing effect is an increase in achievement that occurs simply becausestudents take a test instead of spending the same amount of time some other waysuch as studying The most prominent aspect of the pure testing effect appears to bethe generation effectmdashstudents taking a test cannot passively absorb informationas they might listening to lectures or reading textbooks they must ldquogeneraterdquothe information themselves and that process apparently makes a more durableimpression on memory (Bertsch Pesta Wiscott amp McDaniel 2007)

METHOD

This article summarizes the research literature on the effect of testing on studentachievement from 1910 to 2010 as found in English-language sources

Searches

Two search methodsmdashkeyword searches and citation chainsmdashwere employed Akeyword search is standard in scholarly research First one identifies relevant doc-ument databases and research indexes and relevant keywords that should identifydocuments containing relevant content Then one lets the computer do the workfinding and compiling a list of potential sources In some pure research fieldswhere one could reasonably expect to find the largest proportion of relevant stud-ies in a small selection of scholarly journals a keyword search might be thoroughenough to cover the topic reasonably well For this study more than 40 searchterms were employed (eg test exam examination assessment accountabilitycompetency effect consequences impact benefit cost)

With citation chains one uses the studies found in a keyword search to findrelated studies Each study provides evidence of other relevant studies inside itstext citations or references This is sometimes called the ldquoancestry methodrdquo

To be truly thorough any literature search should follow citation chains but themethod is particularly important to use for a topic like this one First an interestin testing effects overlaps many research field boundaries such as education psy-chology public policy and sociology and many more subfield boundaries Secondtests are often developed by private companies and sponsored by governmentsneither of which is commonly interested in publishing in scholarly journals Thirdsome types of studies simply cannot be found through simple keyword searchesThey include proprietary work audits performed by governmental agencies and

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 23

program evaluations sponsored by them most studies conducted during the firsttwo thirds of the twentieth century and masterrsquos theses Fourth a large proportionof studies that measured testing effects were focused primarily on other resultsand it is those foci and those results that determined which keywords were cho-sen for the research data bases When testing effect findings are unplanned orcoincidental to a study a computer search on typical keywords probably will notfind it Fifth a reliance on keyword search can be inadequate to the task due tovariation in keywords across fields different research disciplines employ differentvocabularies A ldquonet benefitrdquo to an economist for example may be called ldquoconse-quential validityrdquo by psychometricians ldquopositive effectsrdquo by program evaluatorsor ldquopositive washbackrdquo by education planners

Searches for this study were limited only by the English language If onesource led to another via citation or reference that lead was followed But indexeswere used systematically too Those indexes included the Education ResourcesInformation Clearinghouse (ERIC) FirstSearch of the Online Computer LibraryCenter (OCLC) which includes PsychLit and hundreds of other social sciencedata bases Dissertation Abstracts several EBSCO data bases Google GoogleScholar Yahoo Search and several individual librariesrsquo catalogues

In addition to all the aforementioned sources two lists unique to opinion pollswere consulted the Roper Center Public Opinion Online and Polling the Nationsdatabases

A complete list of the studies reviewed and their bibliographic referencesis posted at httpnpeeducationnewsorgReviewResourceslowastlowasthtm substitutingeither ldquoQuantitativeListrdquo ldquoSurveyListrdquo or ldquoQualitativeListrdquo for ldquolowastlowastrdquo

Geographic Coverage

No attempt was made to limit the search by geographic region Naturally howeverlimiting the search to English-language documents biases a search in favor of thosewhere the English language is predominantmdashle monde Anglo-Saxon GrantedEnglish may be used as the language of scientific communication even in countrieswhere another is spoken But the search for this study was not restricted to journalarticles and other types of documents such as government reports and conferencepapers are more likely to be written in a home language

In addition because the search included any opportunist discoveries madein library catalogues and thesis indexes and only libraries in the United Stateswere visited a bias toward United States sources is introduced Finally the publicopinion poll indexes were limited exclusively to US and Canadian sources

Overall 81 of the studies included in the analysis had a primarily US focusbut the geographic coverage varies by methodology type Whereas 89 of thequantitative studies had a US focus only 65 of the qualitative studies did Theseproportions for the United States might seem disproportionately high but the

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

24 PHELPS

United States does in fact host over two-thirds of the English-speaking populationof the industrialized world2

Perhaps in part due to inclusion of the uniquely North American public opinionpoll data 95 of the survey studies had a US focus Indeed there were so fewsurvey studies (N = 5) from outside the United States and Canada they weredropped from the analysis making the analysis of survey studies exclusivelyNorth American

Over 3000 documents were found that potentially contained evidence of testingeffects on achievement Less than one third qualified for inclusion in the analysishowever The other 2000 or so were carefully reviewed found not to containrelevant or sufficient evidence and set aside3 Titles of several hundred morepotentially useful sources fill a do-list of documents waiting to be processed in apotential next stage of this research

Study Types

Searching uncovered a centuryrsquos worth of studies measuring the effects of testing-related intervention on student achievement This broad reach captures quite a va-riety of studies including many of the historically most typicalmdashrelatively smallfocused randomized experiments with undergraduate psychology classesmdashandthe recently popularmdashhugely aggregated state- or nationwide multivariate data-crunchings of accountability regimes with hundreds of independent variables con-tributing explanation obfuscation or both

Quantitative studies One hundred seventy-seven quantitative researchstudies were reviewed that include 640 separate measurements of effects Somestudies report multiple relevant measures of effects (eg measured at differenttimes with different subgroups under different conditions) All told the 640measures comprise information for close to 7 million separate individuals Quan-titative studies directly measure effects and employ for example regression anal-ysis structural equation modeling pre-post comparison experimental design orinterrupted time series design

These quantitative studies manifest a variety of study designs the most numer-ous being straightforward experiments (or quasi-experiments) comparing means(or mean gains or proportions) of a treatment and a control group For exam-ple an experiment might randomly assign students to two different classroomsadminister a baseline test and then test the two groups differently throughout acourse say giving one group a mid-term test and the other group no mid-term testThen a final test could be administered and the pre-post gain scores compared todetermine the effect of giving a mid-term test4

Typically the studies compared groups (or the same [or similar] group at differ-ent time points) facing different tests or test-related situations One group might

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 25

have been tested more frequently than the other tested with higher stakes (iegreater consequences) or provided one of two types of feedbackmdashsimply beingmade aware of their test performance or course progress or given remediation oranother type of corrective action based on their test results

In other studies the experimental group was told that a test would be countedas part of their grade and the control group was told that it would not In a fewexperiments one group was given take-home tests during the course and theother group in-class tests On the final exam the in-class tested students tended toperform better perhaps because the take-homendashtested students limited their reviewof the material to the topics on the take-home tests whereas the in-classndashtestedstudents prepared themselves to be tested on all topics

A small minority of studies employed pre-post comparisons of either the samepopulation (eg 4th-graders one year and 4th-graders in the same jurisdiction alater year) the same cohort (eg 4th-graders followed longitudinally from oneyear to a later year) or a synthetic cohort (eg a sample of 4th-graders one yearcompared to a sample of 8th-graders in the same jurisdiction four years later)

Historically effect-of-testing research has accompanied interest in one or an-other of the specific types of interventions Mastery testing studies for examplewere frequent from the 1960s through the 1980s Accountability studies coincidedwith the popularity (in the United States at least) of ldquominimum competencyrdquo test-ing from the 1970s on Frequency-of-testing studies were most popular in the firsthalf of the twentieth century Memory retention studies were frequently conductedthen too but have also encountered resurgence in popularity in recent years

Table 1 describes this collection of quantitative studies by various attributesincluding study design and the source sponsor and scale of tests used in thestudies

Survey studies Surveys are a type of quantitative study that measures per-ceptions of effectsmdasheither through public opinion polls or surveys of groupsselected to complete a questionnaire as part of a program evaluation Some surveystudies contain multiple relevant items andor posed relevant questions to multiplerespondent groups (eg teachers parents) Two hundred forty-seven survey stud-ies conducted in the United States and Canada were reviewed summarizing about700000 separate individual responses to questions regarding testingrsquos effect onlearning and instruction and preferences for common test types

All told 813 individual item-response group combinations (hereafter calledldquoitemsrdquo) were identified For example if a polling firm posed the same question totwo different respondent groups (eg teachers parents) each item-response groupcombination was counted as a separate item As was inevitable some individualrespondents are represented more than once but never for the same item

Figure 1 breaks out survey items by data sourcemdashpublic opinion poll or surveyembedded in a program evaluationmdashand respondent groupmdasheducation provider

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

26 PHELPS

TABLE 1Source Sponsor Study Design and Scale of Tests in Quantitative Studies

Number of Studies Percent of Studies

Source of TestResearcher or Teacher 87 54Commercial 38 24National 24 15State or District 11 7Total 160 100

Sponsor of TestLocal 99 62National 45 28State 11 7International 5 3Total 160 100

Study DesignExperiment Quasi-experiment 107 67Multivariate 26 16Pre-post 12 8Pre-post (with shadow test) 8 5Experiment Posttest only 7 4Total 160 100

Scale of Test AdministrationClassroom 115 72Large-scale 39 24Mid-scale 6 4Total 160 100

or education consumer The education provider group comprises administratorsboard members teachers counselors and education professors The educationconsumer group comprises the public parents students employers politiciansand tertiary education faculty

Figure 1 reveals a fairly even division of sources among the 800+ surveyitems between public opinion polls (55) and program evaluation surveys (45)Likewise an almost even division of respondent group types between educationproviders (48) and education consumers (52) was found

Among the subtotals however asymmetries emerge Almost five times asmany evaluation survey items were posed to education providers than to educationconsumers Conversely almost three times as many public opinion poll items wereposed to education consumers than to education providers

Table 2 reveals that a majority of the survey items (62) concern high-stakestests (ie tests with consequences for students teachers or schools such asretention in grade [for a student] or licensure denial [for a teacher]) This stands to

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 27

05

10

1520253035

404550

Public Opinion Polls Program EvaluationSurveys

Per

cen

t EducationProviders

EducationConsumers

FIGURE 1Percentage of survey items By respondent group and type of survey

Note lowastThree program evaluation survey items were posed to a combinedprovider-consumer group and for the purpose of this figure those three items have been

counted twice

reason as high-stakes tests are more politically controversial and more familiar tothe public so pollsters and program evaluators are more likely to ask about them

Table 2 also classifies the survey items by the target of the stakesmdashthe groupor groups bearing the consequences of good or poor test performance For aplurality of items students bore the consequences (46) Schools (33) and

TABLE 2Number and Percent of Survey Items By Test Stakes and Their Target Group

Number of Survey Items Percent of Survey Items

Level of StakesHigh 507 62Medium 184 23Unknown 89 11Low 33 4Total 813 100

Target GroupStudents 393 46Schools 281 33Teachers 116 14No Stakes 64 7Total 854 100

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

28 PHELPS

teachers (14) were less often the target of test stakes In some cases stakes wereapplicable to more than one group

Qualitative studies Any study for which an effect size could not be com-puted was designated as qualitative Typically qualitative studies declare evidenceof effects in categorical terms and incorporate more ldquohands onrdquo research methodssuch as direct observations site visits interviews or case studies Here analysis of244 qualitative studies focuses exclusively on the reported direction of the testingeffect on achievement

To determine whether a qualitative study indicated a positive neutral or nega-tive effect of testing on achievement a specific qualitative meta-analysis approachcalled template analysis was employed (eg Crabtree amp Miller 1999 King 19982006 Miles amp Huberman 1994) A hierarchical coding rubric was developed tosummarize and organize the studies according to themes predetermined as im-portant by previous research and informed by ongoing analysis of the data setThe template establishes broad themes (eg stakes) initially then successivelysmaller themes (eg high medium or low) are generated within each broadtheme The template rubric comprised 6 themes and 27 subthemes identified inTable 3

All studies were reviewed and coded by three reviewers a doctoral student inmeta-analysis an editor and me Any disparities in coding were discussed amongreviewers in an attempt to reach a consensus In less than 10 cases in which a clearconsensus was not reached the author made the coding decisions

Qualitative studies used various research methods and in various combinationsIndeed 37 of the 245 studies incorporated more than one method (Table 4)

Table 5 categorizes the qualitative studies according to the level of educationand the scale stakes and target of the stakes of the tests involved Most (81)studies included in this analysis were large-scale while 17 of the studies wereat the classroom level Two percent of the studies focused on testing for teachers

Summary Table 6 lists the overall numbers of studies and effects bytypemdashquantitative survey and qualitativemdashand geographic coveragemdashUS ornon-US Several studies fit more than one category (eg a program evaluationwith both survey and observational case study components)

Calculating Effects

For quantitative and survey studies effect sizes were calculated and the datasummarized with weighted and unweighted means Qualitative studies that reacheda conclusion about the effect of testing on student achievement or on instructionwere tallied as positive negative and in-between

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 29

TABLE 3Coding Themes and Subthemes Used in Template Analysis

Themes Subthemes

Scale of Test bull Large-Scalebull Classroombull Test for Teachersbull Not Specified

Stakes of Test bull Highbull Mediumbull Lowbull Variesbull Not Specified

Target of Stakes bull Studentbull Teacherbull School

Research Methods bull Case Studybull InterviewFocus Groupbull JournalWork Logbull RecordsDocument Reviewbull Research Reviewbull ExperimentPre-Post Comparison for which No

Effect Size Was Reportedbull Survey for which No Effect Size Was Reported

Rigor of Qualitative Study bull Highbull Mediumbull Low

Direction of Testing Effects bull Positivebull Positive Inferredbull Mixedbull No Changebull Negative

TABLE 4Research Method Used in Qualitative Studies

ThemesSubthemes Number of Studies Percent of Studies

Case Study 120 43Interview (Includes Focus Group) 75 27Records or Document Review 33 12Survey 22 8Experiment or Pre-Post Comparison 21 7Research Review 8 3Journals or Work Logs 2 1Total 281 101

Note Percentage sums to greater than 100 due to rounding

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

30 PHELPS

TABLE 5Level of Education Scale Stakes and Target of Stakes in Qualitative Studies

Number of Studies Percent of Studies

Level of EducationTwo or More Levels 115 50Upper Secondary 55 24Elementary 34 15Postsecondary 18 8Lower Secondary 7 3Adult Education 3 1Total 232 101

Scale of TestLarge-scale 196 81Classroom 42 17Test for Teachers 5 2Total 243 100

Stakes of TestHigh 154 63Low 51 21Medium 33 14VariesNot Specified 6 3Total 244 101

Target of StakesStudent 150 62School 80 33Teacher 10 4Varies 1 lt1Total 241 100

Note Percentages may sum to greater than 100 due to rounding

TABLE 6Number of Studies and Effects By Type

Type of Number of Number of Population Coverage Percent of StudiesStudy Studies Effects (in thousands) with US Focus

Quantitative 177 640 7000 89Surveys amp Polls 247 813 700 95Qualitative 245 245 unknown 65Total 669 1698

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 31

Several formulae are employed to calculate effect sizes each selected to alignto its relevant study design The most frequently used is some variation of thestandardized mean difference (Cohen 1988)

d = XT minus XC

sp(1)

where

XT = number in treatment groupXC = number in control groupSp = pooled standard deviation andd = effect size

For those studies that compare proportions rather than means an effect sizecomparable to the standard mean difference effect size is calculated with a log-oddsratio and a logit function transformation (Lipsey amp Wilson 2001 pp 53ndash58)

ES =(

loge

[PT

1 minus PT

]minus loge

[PC

1 minus PC

])183 (2)

PT is the proportion of persons in the treatment group with ldquosuccessfulrdquo out-comes (eg passing a test graduating) and PC is the proportion of persons incontrol group with successful outcomes The logit function transformation is ac-complished by dividing the log-odds ratio by 183

For survey studies the response pattern for each survey is quantitatively sum-marized in one of two ways

bull as frequencies for each point along a Likert scale which can then be summarizedby standard measures of central tendency and dispersion or

bull as frequencies for each multiple choice response option which can then beconverted to percentages

Just over 100 of the almost 800+ survey items reported their responses inmeans and standard deviations on a scale Effects for scale items are calculatedas the difference of the response mean from the scale midpoint divided by thestandard deviation

Close to 700 other survey items with multiple-choice response formats reportedthe frequency and percentage for each choice The analysis collapses these multiplechoices into dichotomous choicesmdashyes and no for and against favorable andunfavorable agree and disagree support and oppose etc Responses that fit neithersidemdashneutral no opinion donrsquot know no responsemdashare not counted An effect

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

32 PHELPS

size is then calculated with the aforementioned log-odds ratio and a logit functiontransformation

Weighting Effect sizes are summarized in two ways with an unweightedmean in which case each effect size counts the same regardless the size of thepopulation under study and with a weighted mean in which case effect sizemeasures calculated on larger groups count for more

For between 10 and 20 of the quantitative studies (depending on how oneaggregates them) that incorporated pre-post comparison designs weights (ieinverse variances) are calculated thus (Lipsey amp Wilson 2001 p 72)

w = 2N

4(1 minus r) + d2(3)

where

N = number in study populationr = correlation coefficient andd = effect size

For the remainder of the studies with standardized mean difference effectsizes weights (ie inverse variances) are calculated thus (Lipsey amp Wilson 2001p 72)

w = 2 times NT times NC(NT + NC)

2(NT + NC)2 + NT times NC times d2(4)

where

NT = number in treatment groupNC = number in control group andd = effect size

For the group of about 100 survey items with Likert-scale responses withstandardized mean effect sizes standard errors and weights (ie inverse variances)are calculated thus

SE = sradicn

w = n

s2(5 6)

wheres = standard deviation andn = number of respondents

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 33

For the group of about 700 items with proportional-difference responses withlog-odds-ratio effect sizes standard errors and weights are calculated thusly

w = 1

SE 2SE =

radic1

a+ 1

b+ 1

c+ 1

d(7 8)

where

a = number of respondents in favorb = number of respondents not in group ac = number of respondents against andd = number of respondents not in group c

Once the weights and standard errors are calculated effect sizes are thenldquoweightedrdquo by the following process (Lipsey amp Wilson 2001 pp 130ndash132)

1 Each effect size is multiplied by its weight wes2 The weighted mean effect size is calculated thusly

ES =sum

wesisumwi

(9)

Standards for judging effect sizes Cohen (1988) proposed that a d lt

020 represented a small effect and a d gt 080 represented a large effect Avalue for d in between then represented a medium effect In his meta-analysis ofmeta-analyses however Hattie (2009) placed the average d for all meta-analysesof student achievement effect studies near 040 with a pseudo-Hawthorne Effectbase level around 010 (ie the level at which any achievement-related interventionnot producing a clear positive or negative effect will settle)

RESULTS

Quantitative Studies

For quantitative studies simple and weighted effect sizes are calculated for sev-eral different aggregations (eg same treatment group control group or studyauthor) Mean effect sizes range from a moderate d asymp 055 to a fairly large d asymp088 depending on the way effects are aggregated or effect sizes are adjusted forstudy artifacts Testing with feedback produces the strongest positive effect on

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

34 PHELPS

achievement Adding stakes or testing with greater frequency also strongly andpositively affects achievement

Studies are classified with dozens of moderators including size of study pop-ulation scale of test administration responsible jurisdiction (eg nation stateclassroom) level of education and primary focus of study (eg memory reten-tion testing frequency mastery testing accountability program)

For the full panoply of 640 different effects the ldquobare bonesrdquo mean d asymp055 After adjustments for small sample sizes (Hedges 1981) and measurementuncertainty in the dependent variable (Hunter amp Schmidt 2004 chapter 8) themean d asymp 071 an apparently stronger effect5

For different aggregations of studies adjusted mean effect sizes vary Groupingstudies by the same treatment or control group or content with differing outcomemeasures (N asymp 600) produces a mean d asymp 073 Grouping studies by the sametreatment or control group with differing content or outcome measures (N asymp 430)produces a mean d asymp 075 Grouping studies by the study author such that a meanis calculated across all of a single authorrsquos studies (N = 160) produces a meand asymp 088

To illustrate the various moderatorsrsquo influence on the effect sizes the same-study-author aggregation is employed Studies in which the treatment group testedmore frequently than the control group produced a mean d asymp 085 Studies in whichthe treatment group was tested with higher stakes than the control group produceda mean d asymp 087 Studies in which the treatment group was made aware of theirtest performance or course progress (and the control group was not) produced amean d asymp 098 Studies in which the treatment group was remediated or receivedsome other type of corrective action based on their test performance produced amean d asymp 096 (See Table 7)

Note that these treatments such as testing more frequently or with higherstakes need not be mutually exclusive can be used complementarily and oftenare This suggests that the optimal intervention would not choose among theseinterventionsmdashall with strong effect sizesmdashbut rather use as many of them aspossible at the same time

TABLE 7Mean Effect Sizes for Various Treatments

Treatment Group Mean Effect Size

is made aware of performance and control group is not 098

receives targeted instruction (eg remediation) 096

is tested with higher stakes than control group 087

is tested more frequently than control group 085

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 35

TABLE 8Source Sponsor Study Design and Scale in Quantitative Studies

Number of Studies Mean Effect Size

Source of TestResearcher or Teacher 87 093National 24 087Commercial 38 082State or District 11 072Total 160

Sponsor of TestInternational 5 102Local 99 093National 45 081State 11 064Total 160

Study DesignPre-post 12 097Experiment Quasi-experiment 107 094Multivariate 26 080Experiment Posttest only 7 060Pre-post (with shadow test) 8 058Total 160

Scale of AnalysisAggregated 9 160Small-scale 118 091Large-scale 33 057Total 160

Scale of Test AdministrationClassroom 115 095Mid-scale 6 072Large-scale 39 071Total 160

Mean effect sizes can be compared between other pairs of moderators too (seeTable 8) For example small population studies such as experiments produce anaverage effect size of (091) whereas large population studies such as those usingnational populations produce an average effect size of (057) Studies of small-scale test administrations (092) produce a larger mean effect size than do studies oflarge-scale test administrations (078) Studies of local test administrations producea larger mean effect size (093) than do studies of state test administrations (064)

Studies conducted at universities (which tend to be small-scale) produce a largermean effect size (102) than do studies conducted at the elementary-secondarylevel (076) (which are often large-scale) Studies in which the intervention occurs

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

36 PHELPS

before the outcome measurement produce a larger mean effect size (093) thando ldquobackwashrdquo studies for which the intervention occurs afterwards (040) (egwhen an increase in lower-secondary test scores is observed after the introductionof a high-stakes upper-secondary examination) Studies in which the subject mattercontent of the intervention and outcome are aligned produce a larger mean effectsize (091) than do studies in which they are not aligned (078)

The quantitative studies vary dramatically in population size (from just 16 to11 million) and their effect sizes tend to correlate negatively The mean for thesmallest quintile among the full 640 effects for example is 104 whereas thatfor the largest quintile is 030 Starting with the adjusted overall mean effect sizeof 071 (for all 640 effects) the mean grows steadily larger as the studies withthe largest populations are removed For all studies with populations less than100000 the mean effect size is 073 under 10000 080 under 1000 083 under500 88 under 100 92 and under 50 106

Generally weighting reduces mean effect sizes further suggesting that thelarger the study population the weaker the results For example the unweightedeffect size of 030 for the largest study population quintile compares to its weightedcounterpart of 028 For the smallest study population quintile the effect size is104 unweighted but only 079 weighted

Survey Studies

Extracting meaning from the 800+ separate survey items required condensationFirst survey items were separated into two overarching groups in which theperception of a testing effect is either explicit or inferred Within the explicitor directly observed category survey items were further classified into two itemtypes those related to improving learning or to improving instruction

Among the 800+ individual survey items analyzed few were worded exactlylike others Often the meaning of two different questions or prompts is verysimilar however even though the wording varies Consider a question posed toUS teachers in the state of North Carolina in 2002 ldquoHave students learned morebecause of the [testing program]rdquo Eighty-three percent responded favorably and12 unfavorably while 5 were neutral (eg no opinion donrsquot know) Thissurvey item is classified in the ldquoimprove learningrdquo item group

Now consider a question posed to teachers in the Canadian Province of Albertain 1990 ldquoThe diploma examinations have positively affected the way in which Iteachrdquo Forty-one percent responded favorably 32 unfavorably and the remain-ing were neutral (or unresponsive) This survey item is classified in the ldquoimproveinstructionrdquo item group

Second the many survey items for which the respondentsrsquo perception of a test-ing effect was inferred were separated into several item groups including ldquoFavorgrade promotion examsrdquo ldquoFavor graduation examsrdquo ldquoFavor teacher competency

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 37

examsrdquo ldquoFavor teacher recertification examsrdquo ldquoFavor more testingrdquo Hold schoolsaccountable for studentsrsquo scoresrdquo and ldquoHold teachers accountable for studentsrsquoscoresrdquo

A perception of a testing effect is inferred when an expression of support for atesting program derives from a belief that the program is beneficial Conversely abelief that a testing program is harmful is inferred from an expression of opposition

For the most basic effects of testing (eg improves learning or instruction) andthe most basic uses of tests (eg graduation certification) effect sizes are largein many cases more than 10 and in some cases more than 20 Effect sizes areweaker for situations in which one group is held accountable for the performanceof anothermdashholding either teachers or schools accountable for student scores

The results vary at most negligibly with study level of rigor or test stakesAnd the results are overwhelmingly positive whether the survey item focusedon testingrsquos affect on improving learning or improving instruction Some of thestudy results can be found in the following tables Table 9 lists the unweightedand weighted effect sizes by item group for the total sample respondent popula-tion With the exception of the item group ldquohold teachers accountable for studentscoresrdquo effect sizes are positive for all item groups The most popular item groupsappear to be ldquoFavor teacher recertification examsrdquo ldquoFavor teacher competency ex-amsrdquo ldquoFavor (student) graduation examsrdquo and ldquoFavor (student) grade promotionexamsrdquo

Effect sizesmdashweighted or unweightedmdashfor these four item groups are all pos-itive and large (ie gt 08) (Cohen 1988) The effect sizes for other item groupssuch as ldquoimprove learningrdquo and ldquoimprove instructionrdquo also are positive and largewhether weighted or unweighted The remaining two item groupsmdashldquofavor moretestingrdquo and ldquohold schools accountable for (student) scoresrdquomdashgarner moderatelylarge positive effect sizes whether weighted or unweighted

Table 8 also breaks out the effect size results by two different groups of re-spondents education providers and consumers Education providers comprise pri-marily administrators and teachers whereas education consumers comprise mostlystudents and noneducator adults

As groups providers and consumers tend to respond similarlymdashstrongly andpositivelymdashto items about test effects (tests ldquoimprove learningrdquo and ldquoimproveinstructionrdquo) Likewise both groups strongly favor high-stakes testing for students(ldquofavor grade promotion examsrdquo and ldquofavor graduation examsrdquo)

Provider and consumer groups differ in their responses in the other item groupshowever For example consumers are overwhelmingly in favor of teacher testingboth competency (ie certification) and recertification exams with weighted effectsizes of 22 and 20 respectively Provider responses are positive but not nearly asstrong with weighted effect sizes of 03 and 04 respectively

Clear differences of opinion between provider and consumer groups can beseen among the remaining item groups too Consumers tend to favor more testing

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

TAB

LE9

Effe

ctS

izes

ofS

urve

yR

espo

nses

By

Item

Gro

upF

orTo

talP

opul

atio

nE

duca

tion

Pro

vide

rsa

ndE

duca

tion

Con

sum

ers

1958

ndash200

81

Tota

lE

duca

tion

Pro

vide

rs2

Edu

cati

onC

onsu

mer

s3

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofE

ffec

tSiz

eE

ffec

tSiz

eN

umbe

rof

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofIt

emTy

pe(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s

Test

ing

Impr

oves

Lea

rnin

g1

30

524

21

20

514

41

30

698

Test

ing

Impr

oves

Inst

ruct

ion

10

07

165

10

06

125

12

10

40Te

stin

gA

lign

sIn

stru

ctio

n0

60

028

Favo

rG

rade

Pro

mot

ion

Exa

ms

13

11

811

21

026

13

11

53Fa

vor

Gra

duat

ion

Exa

ms

14

11

121

12

08

381

41

281

Favo

rM

ore

Test

ing

05

07

72minus0

20

119

07

08

53Fa

vor

Teac

her

Com

pete

ncy

Exa

ms

21

08

611

10

317

25

22

44Fa

vor

Teac

her

Rec

erti

fica

tion

Exa

ms

21

15

222

52

018

Hol

dTe

ache

rsA

ccou

ntab

lefo

rS

core

sminus0

10

031

minus06

minus05

180

60

513

Hol

dS

choo

lsA

ccou

ntab

lefo

rS

core

s0

60

511

10

2minus0

338

09

07

73

Not

es1

Bla

nkce

lls

repr

esen

tnum

bers

ofit

ems

too

smal

lto

repo

rtva

lidl

y2

Edu

cati

onpr

ovid

ers

are

educ

atio

nad

min

istr

ator

spr

inci

pals

tea

cher

sed

ucat

ion

agen

cyof

fici

als

and

univ

ersi

tyfa

cult

yin

educ

atio

npr

ogra

ms

3E

duca

tion

cons

umer

sar

est

uden

tsp

aren

tst

hege

nera

lpu

blic

em

ploy

ers

publ

icof

fici

als

not

wor

king

ined

ucat

ion

agen

cies

and

univ

ersi

tyfa

cult

you

tsid

eof

educ

atio

n

38

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 39

TABLE 10The Effect of Testing on Student Achievement in Qualitative Studies

Direction of Effect Number of Studies Percent of Studies Percent wo Inferred

Positive 204 84 93Positive Inferred 24 10No Change 8 3 4Mixed 5 2 2Negative 3 1 1Total 244 100 100

and holding teachers and schools accountable for student test scores Providerseither are less supportive in these item groups or are opposed For exampleprovidersrsquo weighted effect size for ldquohold teachers accountable for students scoresrdquois a moderately large ndash05

Qualitative Studies

Analysis of the qualitative studies focuses on the direction of the testing effecton achievement or instruction Ninety-three percent of the qualitative studiesanalyzed reported positive effects of testing whereas only 7 reported mixedeffects negative effects or no change

In 24 cases a positive effect on student achievement was not declared butreasonably could be inferred from statements about behavioral changes Onestudy for example reported ldquo[using] results from test to improve courseworkrdquoIf coursework was indeed improved one might reasonably assume that studentachievement benefited as well Whether the ldquopositive inferredrdquo studies are includedin the summary or not however the counts indicate that far more studies findpositive than mixed negative or no effects The main results of this researchsummary are displayed in Table 10

DISCUSSION

One hundred yearsrsquo evidence suggests that testing increases achievement Effectsare moderately to strongly positive for quantitative studies and are very stronglypositive for survey and qualitative studies

The overwhelmingly positive results of the qualitative research review in partic-ular may surprise some readers The results should be considered reliable becausethe base of evidence is so largemdash245 studies conducted over the course of acentury in more than 30 countries

Qualitative studies have been held up by testing opponents as a higher standardfor studies of educational impact Labeling quantitative studies as narrow limited

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 3: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

THE EFFECT OF TESTING ON ACHIEVEMENT 23

program evaluations sponsored by them most studies conducted during the firsttwo thirds of the twentieth century and masterrsquos theses Fourth a large proportionof studies that measured testing effects were focused primarily on other resultsand it is those foci and those results that determined which keywords were cho-sen for the research data bases When testing effect findings are unplanned orcoincidental to a study a computer search on typical keywords probably will notfind it Fifth a reliance on keyword search can be inadequate to the task due tovariation in keywords across fields different research disciplines employ differentvocabularies A ldquonet benefitrdquo to an economist for example may be called ldquoconse-quential validityrdquo by psychometricians ldquopositive effectsrdquo by program evaluatorsor ldquopositive washbackrdquo by education planners

Searches for this study were limited only by the English language If onesource led to another via citation or reference that lead was followed But indexeswere used systematically too Those indexes included the Education ResourcesInformation Clearinghouse (ERIC) FirstSearch of the Online Computer LibraryCenter (OCLC) which includes PsychLit and hundreds of other social sciencedata bases Dissertation Abstracts several EBSCO data bases Google GoogleScholar Yahoo Search and several individual librariesrsquo catalogues

In addition to all the aforementioned sources two lists unique to opinion pollswere consulted the Roper Center Public Opinion Online and Polling the Nationsdatabases

A complete list of the studies reviewed and their bibliographic referencesis posted at httpnpeeducationnewsorgReviewResourceslowastlowasthtm substitutingeither ldquoQuantitativeListrdquo ldquoSurveyListrdquo or ldquoQualitativeListrdquo for ldquolowastlowastrdquo

Geographic Coverage

No attempt was made to limit the search by geographic region Naturally howeverlimiting the search to English-language documents biases a search in favor of thosewhere the English language is predominantmdashle monde Anglo-Saxon GrantedEnglish may be used as the language of scientific communication even in countrieswhere another is spoken But the search for this study was not restricted to journalarticles and other types of documents such as government reports and conferencepapers are more likely to be written in a home language

In addition because the search included any opportunist discoveries madein library catalogues and thesis indexes and only libraries in the United Stateswere visited a bias toward United States sources is introduced Finally the publicopinion poll indexes were limited exclusively to US and Canadian sources

Overall 81 of the studies included in the analysis had a primarily US focusbut the geographic coverage varies by methodology type Whereas 89 of thequantitative studies had a US focus only 65 of the qualitative studies did Theseproportions for the United States might seem disproportionately high but the

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

24 PHELPS

United States does in fact host over two-thirds of the English-speaking populationof the industrialized world2

Perhaps in part due to inclusion of the uniquely North American public opinionpoll data 95 of the survey studies had a US focus Indeed there were so fewsurvey studies (N = 5) from outside the United States and Canada they weredropped from the analysis making the analysis of survey studies exclusivelyNorth American

Over 3000 documents were found that potentially contained evidence of testingeffects on achievement Less than one third qualified for inclusion in the analysishowever The other 2000 or so were carefully reviewed found not to containrelevant or sufficient evidence and set aside3 Titles of several hundred morepotentially useful sources fill a do-list of documents waiting to be processed in apotential next stage of this research

Study Types

Searching uncovered a centuryrsquos worth of studies measuring the effects of testing-related intervention on student achievement This broad reach captures quite a va-riety of studies including many of the historically most typicalmdashrelatively smallfocused randomized experiments with undergraduate psychology classesmdashandthe recently popularmdashhugely aggregated state- or nationwide multivariate data-crunchings of accountability regimes with hundreds of independent variables con-tributing explanation obfuscation or both

Quantitative studies One hundred seventy-seven quantitative researchstudies were reviewed that include 640 separate measurements of effects Somestudies report multiple relevant measures of effects (eg measured at differenttimes with different subgroups under different conditions) All told the 640measures comprise information for close to 7 million separate individuals Quan-titative studies directly measure effects and employ for example regression anal-ysis structural equation modeling pre-post comparison experimental design orinterrupted time series design

These quantitative studies manifest a variety of study designs the most numer-ous being straightforward experiments (or quasi-experiments) comparing means(or mean gains or proportions) of a treatment and a control group For exam-ple an experiment might randomly assign students to two different classroomsadminister a baseline test and then test the two groups differently throughout acourse say giving one group a mid-term test and the other group no mid-term testThen a final test could be administered and the pre-post gain scores compared todetermine the effect of giving a mid-term test4

Typically the studies compared groups (or the same [or similar] group at differ-ent time points) facing different tests or test-related situations One group might

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 25

have been tested more frequently than the other tested with higher stakes (iegreater consequences) or provided one of two types of feedbackmdashsimply beingmade aware of their test performance or course progress or given remediation oranother type of corrective action based on their test results

In other studies the experimental group was told that a test would be countedas part of their grade and the control group was told that it would not In a fewexperiments one group was given take-home tests during the course and theother group in-class tests On the final exam the in-class tested students tended toperform better perhaps because the take-homendashtested students limited their reviewof the material to the topics on the take-home tests whereas the in-classndashtestedstudents prepared themselves to be tested on all topics

A small minority of studies employed pre-post comparisons of either the samepopulation (eg 4th-graders one year and 4th-graders in the same jurisdiction alater year) the same cohort (eg 4th-graders followed longitudinally from oneyear to a later year) or a synthetic cohort (eg a sample of 4th-graders one yearcompared to a sample of 8th-graders in the same jurisdiction four years later)

Historically effect-of-testing research has accompanied interest in one or an-other of the specific types of interventions Mastery testing studies for examplewere frequent from the 1960s through the 1980s Accountability studies coincidedwith the popularity (in the United States at least) of ldquominimum competencyrdquo test-ing from the 1970s on Frequency-of-testing studies were most popular in the firsthalf of the twentieth century Memory retention studies were frequently conductedthen too but have also encountered resurgence in popularity in recent years

Table 1 describes this collection of quantitative studies by various attributesincluding study design and the source sponsor and scale of tests used in thestudies

Survey studies Surveys are a type of quantitative study that measures per-ceptions of effectsmdasheither through public opinion polls or surveys of groupsselected to complete a questionnaire as part of a program evaluation Some surveystudies contain multiple relevant items andor posed relevant questions to multiplerespondent groups (eg teachers parents) Two hundred forty-seven survey stud-ies conducted in the United States and Canada were reviewed summarizing about700000 separate individual responses to questions regarding testingrsquos effect onlearning and instruction and preferences for common test types

All told 813 individual item-response group combinations (hereafter calledldquoitemsrdquo) were identified For example if a polling firm posed the same question totwo different respondent groups (eg teachers parents) each item-response groupcombination was counted as a separate item As was inevitable some individualrespondents are represented more than once but never for the same item

Figure 1 breaks out survey items by data sourcemdashpublic opinion poll or surveyembedded in a program evaluationmdashand respondent groupmdasheducation provider

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

26 PHELPS

TABLE 1Source Sponsor Study Design and Scale of Tests in Quantitative Studies

Number of Studies Percent of Studies

Source of TestResearcher or Teacher 87 54Commercial 38 24National 24 15State or District 11 7Total 160 100

Sponsor of TestLocal 99 62National 45 28State 11 7International 5 3Total 160 100

Study DesignExperiment Quasi-experiment 107 67Multivariate 26 16Pre-post 12 8Pre-post (with shadow test) 8 5Experiment Posttest only 7 4Total 160 100

Scale of Test AdministrationClassroom 115 72Large-scale 39 24Mid-scale 6 4Total 160 100

or education consumer The education provider group comprises administratorsboard members teachers counselors and education professors The educationconsumer group comprises the public parents students employers politiciansand tertiary education faculty

Figure 1 reveals a fairly even division of sources among the 800+ surveyitems between public opinion polls (55) and program evaluation surveys (45)Likewise an almost even division of respondent group types between educationproviders (48) and education consumers (52) was found

Among the subtotals however asymmetries emerge Almost five times asmany evaluation survey items were posed to education providers than to educationconsumers Conversely almost three times as many public opinion poll items wereposed to education consumers than to education providers

Table 2 reveals that a majority of the survey items (62) concern high-stakestests (ie tests with consequences for students teachers or schools such asretention in grade [for a student] or licensure denial [for a teacher]) This stands to

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 27

05

10

1520253035

404550

Public Opinion Polls Program EvaluationSurveys

Per

cen

t EducationProviders

EducationConsumers

FIGURE 1Percentage of survey items By respondent group and type of survey

Note lowastThree program evaluation survey items were posed to a combinedprovider-consumer group and for the purpose of this figure those three items have been

counted twice

reason as high-stakes tests are more politically controversial and more familiar tothe public so pollsters and program evaluators are more likely to ask about them

Table 2 also classifies the survey items by the target of the stakesmdashthe groupor groups bearing the consequences of good or poor test performance For aplurality of items students bore the consequences (46) Schools (33) and

TABLE 2Number and Percent of Survey Items By Test Stakes and Their Target Group

Number of Survey Items Percent of Survey Items

Level of StakesHigh 507 62Medium 184 23Unknown 89 11Low 33 4Total 813 100

Target GroupStudents 393 46Schools 281 33Teachers 116 14No Stakes 64 7Total 854 100

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

28 PHELPS

teachers (14) were less often the target of test stakes In some cases stakes wereapplicable to more than one group

Qualitative studies Any study for which an effect size could not be com-puted was designated as qualitative Typically qualitative studies declare evidenceof effects in categorical terms and incorporate more ldquohands onrdquo research methodssuch as direct observations site visits interviews or case studies Here analysis of244 qualitative studies focuses exclusively on the reported direction of the testingeffect on achievement

To determine whether a qualitative study indicated a positive neutral or nega-tive effect of testing on achievement a specific qualitative meta-analysis approachcalled template analysis was employed (eg Crabtree amp Miller 1999 King 19982006 Miles amp Huberman 1994) A hierarchical coding rubric was developed tosummarize and organize the studies according to themes predetermined as im-portant by previous research and informed by ongoing analysis of the data setThe template establishes broad themes (eg stakes) initially then successivelysmaller themes (eg high medium or low) are generated within each broadtheme The template rubric comprised 6 themes and 27 subthemes identified inTable 3

All studies were reviewed and coded by three reviewers a doctoral student inmeta-analysis an editor and me Any disparities in coding were discussed amongreviewers in an attempt to reach a consensus In less than 10 cases in which a clearconsensus was not reached the author made the coding decisions

Qualitative studies used various research methods and in various combinationsIndeed 37 of the 245 studies incorporated more than one method (Table 4)

Table 5 categorizes the qualitative studies according to the level of educationand the scale stakes and target of the stakes of the tests involved Most (81)studies included in this analysis were large-scale while 17 of the studies wereat the classroom level Two percent of the studies focused on testing for teachers

Summary Table 6 lists the overall numbers of studies and effects bytypemdashquantitative survey and qualitativemdashand geographic coveragemdashUS ornon-US Several studies fit more than one category (eg a program evaluationwith both survey and observational case study components)

Calculating Effects

For quantitative and survey studies effect sizes were calculated and the datasummarized with weighted and unweighted means Qualitative studies that reacheda conclusion about the effect of testing on student achievement or on instructionwere tallied as positive negative and in-between

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 29

TABLE 3Coding Themes and Subthemes Used in Template Analysis

Themes Subthemes

Scale of Test bull Large-Scalebull Classroombull Test for Teachersbull Not Specified

Stakes of Test bull Highbull Mediumbull Lowbull Variesbull Not Specified

Target of Stakes bull Studentbull Teacherbull School

Research Methods bull Case Studybull InterviewFocus Groupbull JournalWork Logbull RecordsDocument Reviewbull Research Reviewbull ExperimentPre-Post Comparison for which No

Effect Size Was Reportedbull Survey for which No Effect Size Was Reported

Rigor of Qualitative Study bull Highbull Mediumbull Low

Direction of Testing Effects bull Positivebull Positive Inferredbull Mixedbull No Changebull Negative

TABLE 4Research Method Used in Qualitative Studies

ThemesSubthemes Number of Studies Percent of Studies

Case Study 120 43Interview (Includes Focus Group) 75 27Records or Document Review 33 12Survey 22 8Experiment or Pre-Post Comparison 21 7Research Review 8 3Journals or Work Logs 2 1Total 281 101

Note Percentage sums to greater than 100 due to rounding

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

30 PHELPS

TABLE 5Level of Education Scale Stakes and Target of Stakes in Qualitative Studies

Number of Studies Percent of Studies

Level of EducationTwo or More Levels 115 50Upper Secondary 55 24Elementary 34 15Postsecondary 18 8Lower Secondary 7 3Adult Education 3 1Total 232 101

Scale of TestLarge-scale 196 81Classroom 42 17Test for Teachers 5 2Total 243 100

Stakes of TestHigh 154 63Low 51 21Medium 33 14VariesNot Specified 6 3Total 244 101

Target of StakesStudent 150 62School 80 33Teacher 10 4Varies 1 lt1Total 241 100

Note Percentages may sum to greater than 100 due to rounding

TABLE 6Number of Studies and Effects By Type

Type of Number of Number of Population Coverage Percent of StudiesStudy Studies Effects (in thousands) with US Focus

Quantitative 177 640 7000 89Surveys amp Polls 247 813 700 95Qualitative 245 245 unknown 65Total 669 1698

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 31

Several formulae are employed to calculate effect sizes each selected to alignto its relevant study design The most frequently used is some variation of thestandardized mean difference (Cohen 1988)

d = XT minus XC

sp(1)

where

XT = number in treatment groupXC = number in control groupSp = pooled standard deviation andd = effect size

For those studies that compare proportions rather than means an effect sizecomparable to the standard mean difference effect size is calculated with a log-oddsratio and a logit function transformation (Lipsey amp Wilson 2001 pp 53ndash58)

ES =(

loge

[PT

1 minus PT

]minus loge

[PC

1 minus PC

])183 (2)

PT is the proportion of persons in the treatment group with ldquosuccessfulrdquo out-comes (eg passing a test graduating) and PC is the proportion of persons incontrol group with successful outcomes The logit function transformation is ac-complished by dividing the log-odds ratio by 183

For survey studies the response pattern for each survey is quantitatively sum-marized in one of two ways

bull as frequencies for each point along a Likert scale which can then be summarizedby standard measures of central tendency and dispersion or

bull as frequencies for each multiple choice response option which can then beconverted to percentages

Just over 100 of the almost 800+ survey items reported their responses inmeans and standard deviations on a scale Effects for scale items are calculatedas the difference of the response mean from the scale midpoint divided by thestandard deviation

Close to 700 other survey items with multiple-choice response formats reportedthe frequency and percentage for each choice The analysis collapses these multiplechoices into dichotomous choicesmdashyes and no for and against favorable andunfavorable agree and disagree support and oppose etc Responses that fit neithersidemdashneutral no opinion donrsquot know no responsemdashare not counted An effect

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

32 PHELPS

size is then calculated with the aforementioned log-odds ratio and a logit functiontransformation

Weighting Effect sizes are summarized in two ways with an unweightedmean in which case each effect size counts the same regardless the size of thepopulation under study and with a weighted mean in which case effect sizemeasures calculated on larger groups count for more

For between 10 and 20 of the quantitative studies (depending on how oneaggregates them) that incorporated pre-post comparison designs weights (ieinverse variances) are calculated thus (Lipsey amp Wilson 2001 p 72)

w = 2N

4(1 minus r) + d2(3)

where

N = number in study populationr = correlation coefficient andd = effect size

For the remainder of the studies with standardized mean difference effectsizes weights (ie inverse variances) are calculated thus (Lipsey amp Wilson 2001p 72)

w = 2 times NT times NC(NT + NC)

2(NT + NC)2 + NT times NC times d2(4)

where

NT = number in treatment groupNC = number in control group andd = effect size

For the group of about 100 survey items with Likert-scale responses withstandardized mean effect sizes standard errors and weights (ie inverse variances)are calculated thus

SE = sradicn

w = n

s2(5 6)

wheres = standard deviation andn = number of respondents

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 33

For the group of about 700 items with proportional-difference responses withlog-odds-ratio effect sizes standard errors and weights are calculated thusly

w = 1

SE 2SE =

radic1

a+ 1

b+ 1

c+ 1

d(7 8)

where

a = number of respondents in favorb = number of respondents not in group ac = number of respondents against andd = number of respondents not in group c

Once the weights and standard errors are calculated effect sizes are thenldquoweightedrdquo by the following process (Lipsey amp Wilson 2001 pp 130ndash132)

1 Each effect size is multiplied by its weight wes2 The weighted mean effect size is calculated thusly

ES =sum

wesisumwi

(9)

Standards for judging effect sizes Cohen (1988) proposed that a d lt

020 represented a small effect and a d gt 080 represented a large effect Avalue for d in between then represented a medium effect In his meta-analysis ofmeta-analyses however Hattie (2009) placed the average d for all meta-analysesof student achievement effect studies near 040 with a pseudo-Hawthorne Effectbase level around 010 (ie the level at which any achievement-related interventionnot producing a clear positive or negative effect will settle)

RESULTS

Quantitative Studies

For quantitative studies simple and weighted effect sizes are calculated for sev-eral different aggregations (eg same treatment group control group or studyauthor) Mean effect sizes range from a moderate d asymp 055 to a fairly large d asymp088 depending on the way effects are aggregated or effect sizes are adjusted forstudy artifacts Testing with feedback produces the strongest positive effect on

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

34 PHELPS

achievement Adding stakes or testing with greater frequency also strongly andpositively affects achievement

Studies are classified with dozens of moderators including size of study pop-ulation scale of test administration responsible jurisdiction (eg nation stateclassroom) level of education and primary focus of study (eg memory reten-tion testing frequency mastery testing accountability program)

For the full panoply of 640 different effects the ldquobare bonesrdquo mean d asymp055 After adjustments for small sample sizes (Hedges 1981) and measurementuncertainty in the dependent variable (Hunter amp Schmidt 2004 chapter 8) themean d asymp 071 an apparently stronger effect5

For different aggregations of studies adjusted mean effect sizes vary Groupingstudies by the same treatment or control group or content with differing outcomemeasures (N asymp 600) produces a mean d asymp 073 Grouping studies by the sametreatment or control group with differing content or outcome measures (N asymp 430)produces a mean d asymp 075 Grouping studies by the study author such that a meanis calculated across all of a single authorrsquos studies (N = 160) produces a meand asymp 088

To illustrate the various moderatorsrsquo influence on the effect sizes the same-study-author aggregation is employed Studies in which the treatment group testedmore frequently than the control group produced a mean d asymp 085 Studies in whichthe treatment group was tested with higher stakes than the control group produceda mean d asymp 087 Studies in which the treatment group was made aware of theirtest performance or course progress (and the control group was not) produced amean d asymp 098 Studies in which the treatment group was remediated or receivedsome other type of corrective action based on their test performance produced amean d asymp 096 (See Table 7)

Note that these treatments such as testing more frequently or with higherstakes need not be mutually exclusive can be used complementarily and oftenare This suggests that the optimal intervention would not choose among theseinterventionsmdashall with strong effect sizesmdashbut rather use as many of them aspossible at the same time

TABLE 7Mean Effect Sizes for Various Treatments

Treatment Group Mean Effect Size

is made aware of performance and control group is not 098

receives targeted instruction (eg remediation) 096

is tested with higher stakes than control group 087

is tested more frequently than control group 085

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 35

TABLE 8Source Sponsor Study Design and Scale in Quantitative Studies

Number of Studies Mean Effect Size

Source of TestResearcher or Teacher 87 093National 24 087Commercial 38 082State or District 11 072Total 160

Sponsor of TestInternational 5 102Local 99 093National 45 081State 11 064Total 160

Study DesignPre-post 12 097Experiment Quasi-experiment 107 094Multivariate 26 080Experiment Posttest only 7 060Pre-post (with shadow test) 8 058Total 160

Scale of AnalysisAggregated 9 160Small-scale 118 091Large-scale 33 057Total 160

Scale of Test AdministrationClassroom 115 095Mid-scale 6 072Large-scale 39 071Total 160

Mean effect sizes can be compared between other pairs of moderators too (seeTable 8) For example small population studies such as experiments produce anaverage effect size of (091) whereas large population studies such as those usingnational populations produce an average effect size of (057) Studies of small-scale test administrations (092) produce a larger mean effect size than do studies oflarge-scale test administrations (078) Studies of local test administrations producea larger mean effect size (093) than do studies of state test administrations (064)

Studies conducted at universities (which tend to be small-scale) produce a largermean effect size (102) than do studies conducted at the elementary-secondarylevel (076) (which are often large-scale) Studies in which the intervention occurs

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

36 PHELPS

before the outcome measurement produce a larger mean effect size (093) thando ldquobackwashrdquo studies for which the intervention occurs afterwards (040) (egwhen an increase in lower-secondary test scores is observed after the introductionof a high-stakes upper-secondary examination) Studies in which the subject mattercontent of the intervention and outcome are aligned produce a larger mean effectsize (091) than do studies in which they are not aligned (078)

The quantitative studies vary dramatically in population size (from just 16 to11 million) and their effect sizes tend to correlate negatively The mean for thesmallest quintile among the full 640 effects for example is 104 whereas thatfor the largest quintile is 030 Starting with the adjusted overall mean effect sizeof 071 (for all 640 effects) the mean grows steadily larger as the studies withthe largest populations are removed For all studies with populations less than100000 the mean effect size is 073 under 10000 080 under 1000 083 under500 88 under 100 92 and under 50 106

Generally weighting reduces mean effect sizes further suggesting that thelarger the study population the weaker the results For example the unweightedeffect size of 030 for the largest study population quintile compares to its weightedcounterpart of 028 For the smallest study population quintile the effect size is104 unweighted but only 079 weighted

Survey Studies

Extracting meaning from the 800+ separate survey items required condensationFirst survey items were separated into two overarching groups in which theperception of a testing effect is either explicit or inferred Within the explicitor directly observed category survey items were further classified into two itemtypes those related to improving learning or to improving instruction

Among the 800+ individual survey items analyzed few were worded exactlylike others Often the meaning of two different questions or prompts is verysimilar however even though the wording varies Consider a question posed toUS teachers in the state of North Carolina in 2002 ldquoHave students learned morebecause of the [testing program]rdquo Eighty-three percent responded favorably and12 unfavorably while 5 were neutral (eg no opinion donrsquot know) Thissurvey item is classified in the ldquoimprove learningrdquo item group

Now consider a question posed to teachers in the Canadian Province of Albertain 1990 ldquoThe diploma examinations have positively affected the way in which Iteachrdquo Forty-one percent responded favorably 32 unfavorably and the remain-ing were neutral (or unresponsive) This survey item is classified in the ldquoimproveinstructionrdquo item group

Second the many survey items for which the respondentsrsquo perception of a test-ing effect was inferred were separated into several item groups including ldquoFavorgrade promotion examsrdquo ldquoFavor graduation examsrdquo ldquoFavor teacher competency

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 37

examsrdquo ldquoFavor teacher recertification examsrdquo ldquoFavor more testingrdquo Hold schoolsaccountable for studentsrsquo scoresrdquo and ldquoHold teachers accountable for studentsrsquoscoresrdquo

A perception of a testing effect is inferred when an expression of support for atesting program derives from a belief that the program is beneficial Conversely abelief that a testing program is harmful is inferred from an expression of opposition

For the most basic effects of testing (eg improves learning or instruction) andthe most basic uses of tests (eg graduation certification) effect sizes are largein many cases more than 10 and in some cases more than 20 Effect sizes areweaker for situations in which one group is held accountable for the performanceof anothermdashholding either teachers or schools accountable for student scores

The results vary at most negligibly with study level of rigor or test stakesAnd the results are overwhelmingly positive whether the survey item focusedon testingrsquos affect on improving learning or improving instruction Some of thestudy results can be found in the following tables Table 9 lists the unweightedand weighted effect sizes by item group for the total sample respondent popula-tion With the exception of the item group ldquohold teachers accountable for studentscoresrdquo effect sizes are positive for all item groups The most popular item groupsappear to be ldquoFavor teacher recertification examsrdquo ldquoFavor teacher competency ex-amsrdquo ldquoFavor (student) graduation examsrdquo and ldquoFavor (student) grade promotionexamsrdquo

Effect sizesmdashweighted or unweightedmdashfor these four item groups are all pos-itive and large (ie gt 08) (Cohen 1988) The effect sizes for other item groupssuch as ldquoimprove learningrdquo and ldquoimprove instructionrdquo also are positive and largewhether weighted or unweighted The remaining two item groupsmdashldquofavor moretestingrdquo and ldquohold schools accountable for (student) scoresrdquomdashgarner moderatelylarge positive effect sizes whether weighted or unweighted

Table 8 also breaks out the effect size results by two different groups of re-spondents education providers and consumers Education providers comprise pri-marily administrators and teachers whereas education consumers comprise mostlystudents and noneducator adults

As groups providers and consumers tend to respond similarlymdashstrongly andpositivelymdashto items about test effects (tests ldquoimprove learningrdquo and ldquoimproveinstructionrdquo) Likewise both groups strongly favor high-stakes testing for students(ldquofavor grade promotion examsrdquo and ldquofavor graduation examsrdquo)

Provider and consumer groups differ in their responses in the other item groupshowever For example consumers are overwhelmingly in favor of teacher testingboth competency (ie certification) and recertification exams with weighted effectsizes of 22 and 20 respectively Provider responses are positive but not nearly asstrong with weighted effect sizes of 03 and 04 respectively

Clear differences of opinion between provider and consumer groups can beseen among the remaining item groups too Consumers tend to favor more testing

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

TAB

LE9

Effe

ctS

izes

ofS

urve

yR

espo

nses

By

Item

Gro

upF

orTo

talP

opul

atio

nE

duca

tion

Pro

vide

rsa

ndE

duca

tion

Con

sum

ers

1958

ndash200

81

Tota

lE

duca

tion

Pro

vide

rs2

Edu

cati

onC

onsu

mer

s3

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofE

ffec

tSiz

eE

ffec

tSiz

eN

umbe

rof

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofIt

emTy

pe(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s

Test

ing

Impr

oves

Lea

rnin

g1

30

524

21

20

514

41

30

698

Test

ing

Impr

oves

Inst

ruct

ion

10

07

165

10

06

125

12

10

40Te

stin

gA

lign

sIn

stru

ctio

n0

60

028

Favo

rG

rade

Pro

mot

ion

Exa

ms

13

11

811

21

026

13

11

53Fa

vor

Gra

duat

ion

Exa

ms

14

11

121

12

08

381

41

281

Favo

rM

ore

Test

ing

05

07

72minus0

20

119

07

08

53Fa

vor

Teac

her

Com

pete

ncy

Exa

ms

21

08

611

10

317

25

22

44Fa

vor

Teac

her

Rec

erti

fica

tion

Exa

ms

21

15

222

52

018

Hol

dTe

ache

rsA

ccou

ntab

lefo

rS

core

sminus0

10

031

minus06

minus05

180

60

513

Hol

dS

choo

lsA

ccou

ntab

lefo

rS

core

s0

60

511

10

2minus0

338

09

07

73

Not

es1

Bla

nkce

lls

repr

esen

tnum

bers

ofit

ems

too

smal

lto

repo

rtva

lidl

y2

Edu

cati

onpr

ovid

ers

are

educ

atio

nad

min

istr

ator

spr

inci

pals

tea

cher

sed

ucat

ion

agen

cyof

fici

als

and

univ

ersi

tyfa

cult

yin

educ

atio

npr

ogra

ms

3E

duca

tion

cons

umer

sar

est

uden

tsp

aren

tst

hege

nera

lpu

blic

em

ploy

ers

publ

icof

fici

als

not

wor

king

ined

ucat

ion

agen

cies

and

univ

ersi

tyfa

cult

you

tsid

eof

educ

atio

n

38

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 39

TABLE 10The Effect of Testing on Student Achievement in Qualitative Studies

Direction of Effect Number of Studies Percent of Studies Percent wo Inferred

Positive 204 84 93Positive Inferred 24 10No Change 8 3 4Mixed 5 2 2Negative 3 1 1Total 244 100 100

and holding teachers and schools accountable for student test scores Providerseither are less supportive in these item groups or are opposed For exampleprovidersrsquo weighted effect size for ldquohold teachers accountable for students scoresrdquois a moderately large ndash05

Qualitative Studies

Analysis of the qualitative studies focuses on the direction of the testing effecton achievement or instruction Ninety-three percent of the qualitative studiesanalyzed reported positive effects of testing whereas only 7 reported mixedeffects negative effects or no change

In 24 cases a positive effect on student achievement was not declared butreasonably could be inferred from statements about behavioral changes Onestudy for example reported ldquo[using] results from test to improve courseworkrdquoIf coursework was indeed improved one might reasonably assume that studentachievement benefited as well Whether the ldquopositive inferredrdquo studies are includedin the summary or not however the counts indicate that far more studies findpositive than mixed negative or no effects The main results of this researchsummary are displayed in Table 10

DISCUSSION

One hundred yearsrsquo evidence suggests that testing increases achievement Effectsare moderately to strongly positive for quantitative studies and are very stronglypositive for survey and qualitative studies

The overwhelmingly positive results of the qualitative research review in partic-ular may surprise some readers The results should be considered reliable becausethe base of evidence is so largemdash245 studies conducted over the course of acentury in more than 30 countries

Qualitative studies have been held up by testing opponents as a higher standardfor studies of educational impact Labeling quantitative studies as narrow limited

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 4: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

24 PHELPS

United States does in fact host over two-thirds of the English-speaking populationof the industrialized world2

Perhaps in part due to inclusion of the uniquely North American public opinionpoll data 95 of the survey studies had a US focus Indeed there were so fewsurvey studies (N = 5) from outside the United States and Canada they weredropped from the analysis making the analysis of survey studies exclusivelyNorth American

Over 3000 documents were found that potentially contained evidence of testingeffects on achievement Less than one third qualified for inclusion in the analysishowever The other 2000 or so were carefully reviewed found not to containrelevant or sufficient evidence and set aside3 Titles of several hundred morepotentially useful sources fill a do-list of documents waiting to be processed in apotential next stage of this research

Study Types

Searching uncovered a centuryrsquos worth of studies measuring the effects of testing-related intervention on student achievement This broad reach captures quite a va-riety of studies including many of the historically most typicalmdashrelatively smallfocused randomized experiments with undergraduate psychology classesmdashandthe recently popularmdashhugely aggregated state- or nationwide multivariate data-crunchings of accountability regimes with hundreds of independent variables con-tributing explanation obfuscation or both

Quantitative studies One hundred seventy-seven quantitative researchstudies were reviewed that include 640 separate measurements of effects Somestudies report multiple relevant measures of effects (eg measured at differenttimes with different subgroups under different conditions) All told the 640measures comprise information for close to 7 million separate individuals Quan-titative studies directly measure effects and employ for example regression anal-ysis structural equation modeling pre-post comparison experimental design orinterrupted time series design

These quantitative studies manifest a variety of study designs the most numer-ous being straightforward experiments (or quasi-experiments) comparing means(or mean gains or proportions) of a treatment and a control group For exam-ple an experiment might randomly assign students to two different classroomsadminister a baseline test and then test the two groups differently throughout acourse say giving one group a mid-term test and the other group no mid-term testThen a final test could be administered and the pre-post gain scores compared todetermine the effect of giving a mid-term test4

Typically the studies compared groups (or the same [or similar] group at differ-ent time points) facing different tests or test-related situations One group might

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 25

have been tested more frequently than the other tested with higher stakes (iegreater consequences) or provided one of two types of feedbackmdashsimply beingmade aware of their test performance or course progress or given remediation oranother type of corrective action based on their test results

In other studies the experimental group was told that a test would be countedas part of their grade and the control group was told that it would not In a fewexperiments one group was given take-home tests during the course and theother group in-class tests On the final exam the in-class tested students tended toperform better perhaps because the take-homendashtested students limited their reviewof the material to the topics on the take-home tests whereas the in-classndashtestedstudents prepared themselves to be tested on all topics

A small minority of studies employed pre-post comparisons of either the samepopulation (eg 4th-graders one year and 4th-graders in the same jurisdiction alater year) the same cohort (eg 4th-graders followed longitudinally from oneyear to a later year) or a synthetic cohort (eg a sample of 4th-graders one yearcompared to a sample of 8th-graders in the same jurisdiction four years later)

Historically effect-of-testing research has accompanied interest in one or an-other of the specific types of interventions Mastery testing studies for examplewere frequent from the 1960s through the 1980s Accountability studies coincidedwith the popularity (in the United States at least) of ldquominimum competencyrdquo test-ing from the 1970s on Frequency-of-testing studies were most popular in the firsthalf of the twentieth century Memory retention studies were frequently conductedthen too but have also encountered resurgence in popularity in recent years

Table 1 describes this collection of quantitative studies by various attributesincluding study design and the source sponsor and scale of tests used in thestudies

Survey studies Surveys are a type of quantitative study that measures per-ceptions of effectsmdasheither through public opinion polls or surveys of groupsselected to complete a questionnaire as part of a program evaluation Some surveystudies contain multiple relevant items andor posed relevant questions to multiplerespondent groups (eg teachers parents) Two hundred forty-seven survey stud-ies conducted in the United States and Canada were reviewed summarizing about700000 separate individual responses to questions regarding testingrsquos effect onlearning and instruction and preferences for common test types

All told 813 individual item-response group combinations (hereafter calledldquoitemsrdquo) were identified For example if a polling firm posed the same question totwo different respondent groups (eg teachers parents) each item-response groupcombination was counted as a separate item As was inevitable some individualrespondents are represented more than once but never for the same item

Figure 1 breaks out survey items by data sourcemdashpublic opinion poll or surveyembedded in a program evaluationmdashand respondent groupmdasheducation provider

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

26 PHELPS

TABLE 1Source Sponsor Study Design and Scale of Tests in Quantitative Studies

Number of Studies Percent of Studies

Source of TestResearcher or Teacher 87 54Commercial 38 24National 24 15State or District 11 7Total 160 100

Sponsor of TestLocal 99 62National 45 28State 11 7International 5 3Total 160 100

Study DesignExperiment Quasi-experiment 107 67Multivariate 26 16Pre-post 12 8Pre-post (with shadow test) 8 5Experiment Posttest only 7 4Total 160 100

Scale of Test AdministrationClassroom 115 72Large-scale 39 24Mid-scale 6 4Total 160 100

or education consumer The education provider group comprises administratorsboard members teachers counselors and education professors The educationconsumer group comprises the public parents students employers politiciansand tertiary education faculty

Figure 1 reveals a fairly even division of sources among the 800+ surveyitems between public opinion polls (55) and program evaluation surveys (45)Likewise an almost even division of respondent group types between educationproviders (48) and education consumers (52) was found

Among the subtotals however asymmetries emerge Almost five times asmany evaluation survey items were posed to education providers than to educationconsumers Conversely almost three times as many public opinion poll items wereposed to education consumers than to education providers

Table 2 reveals that a majority of the survey items (62) concern high-stakestests (ie tests with consequences for students teachers or schools such asretention in grade [for a student] or licensure denial [for a teacher]) This stands to

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 27

05

10

1520253035

404550

Public Opinion Polls Program EvaluationSurveys

Per

cen

t EducationProviders

EducationConsumers

FIGURE 1Percentage of survey items By respondent group and type of survey

Note lowastThree program evaluation survey items were posed to a combinedprovider-consumer group and for the purpose of this figure those three items have been

counted twice

reason as high-stakes tests are more politically controversial and more familiar tothe public so pollsters and program evaluators are more likely to ask about them

Table 2 also classifies the survey items by the target of the stakesmdashthe groupor groups bearing the consequences of good or poor test performance For aplurality of items students bore the consequences (46) Schools (33) and

TABLE 2Number and Percent of Survey Items By Test Stakes and Their Target Group

Number of Survey Items Percent of Survey Items

Level of StakesHigh 507 62Medium 184 23Unknown 89 11Low 33 4Total 813 100

Target GroupStudents 393 46Schools 281 33Teachers 116 14No Stakes 64 7Total 854 100

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

28 PHELPS

teachers (14) were less often the target of test stakes In some cases stakes wereapplicable to more than one group

Qualitative studies Any study for which an effect size could not be com-puted was designated as qualitative Typically qualitative studies declare evidenceof effects in categorical terms and incorporate more ldquohands onrdquo research methodssuch as direct observations site visits interviews or case studies Here analysis of244 qualitative studies focuses exclusively on the reported direction of the testingeffect on achievement

To determine whether a qualitative study indicated a positive neutral or nega-tive effect of testing on achievement a specific qualitative meta-analysis approachcalled template analysis was employed (eg Crabtree amp Miller 1999 King 19982006 Miles amp Huberman 1994) A hierarchical coding rubric was developed tosummarize and organize the studies according to themes predetermined as im-portant by previous research and informed by ongoing analysis of the data setThe template establishes broad themes (eg stakes) initially then successivelysmaller themes (eg high medium or low) are generated within each broadtheme The template rubric comprised 6 themes and 27 subthemes identified inTable 3

All studies were reviewed and coded by three reviewers a doctoral student inmeta-analysis an editor and me Any disparities in coding were discussed amongreviewers in an attempt to reach a consensus In less than 10 cases in which a clearconsensus was not reached the author made the coding decisions

Qualitative studies used various research methods and in various combinationsIndeed 37 of the 245 studies incorporated more than one method (Table 4)

Table 5 categorizes the qualitative studies according to the level of educationand the scale stakes and target of the stakes of the tests involved Most (81)studies included in this analysis were large-scale while 17 of the studies wereat the classroom level Two percent of the studies focused on testing for teachers

Summary Table 6 lists the overall numbers of studies and effects bytypemdashquantitative survey and qualitativemdashand geographic coveragemdashUS ornon-US Several studies fit more than one category (eg a program evaluationwith both survey and observational case study components)

Calculating Effects

For quantitative and survey studies effect sizes were calculated and the datasummarized with weighted and unweighted means Qualitative studies that reacheda conclusion about the effect of testing on student achievement or on instructionwere tallied as positive negative and in-between

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 29

TABLE 3Coding Themes and Subthemes Used in Template Analysis

Themes Subthemes

Scale of Test bull Large-Scalebull Classroombull Test for Teachersbull Not Specified

Stakes of Test bull Highbull Mediumbull Lowbull Variesbull Not Specified

Target of Stakes bull Studentbull Teacherbull School

Research Methods bull Case Studybull InterviewFocus Groupbull JournalWork Logbull RecordsDocument Reviewbull Research Reviewbull ExperimentPre-Post Comparison for which No

Effect Size Was Reportedbull Survey for which No Effect Size Was Reported

Rigor of Qualitative Study bull Highbull Mediumbull Low

Direction of Testing Effects bull Positivebull Positive Inferredbull Mixedbull No Changebull Negative

TABLE 4Research Method Used in Qualitative Studies

ThemesSubthemes Number of Studies Percent of Studies

Case Study 120 43Interview (Includes Focus Group) 75 27Records or Document Review 33 12Survey 22 8Experiment or Pre-Post Comparison 21 7Research Review 8 3Journals or Work Logs 2 1Total 281 101

Note Percentage sums to greater than 100 due to rounding

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

30 PHELPS

TABLE 5Level of Education Scale Stakes and Target of Stakes in Qualitative Studies

Number of Studies Percent of Studies

Level of EducationTwo or More Levels 115 50Upper Secondary 55 24Elementary 34 15Postsecondary 18 8Lower Secondary 7 3Adult Education 3 1Total 232 101

Scale of TestLarge-scale 196 81Classroom 42 17Test for Teachers 5 2Total 243 100

Stakes of TestHigh 154 63Low 51 21Medium 33 14VariesNot Specified 6 3Total 244 101

Target of StakesStudent 150 62School 80 33Teacher 10 4Varies 1 lt1Total 241 100

Note Percentages may sum to greater than 100 due to rounding

TABLE 6Number of Studies and Effects By Type

Type of Number of Number of Population Coverage Percent of StudiesStudy Studies Effects (in thousands) with US Focus

Quantitative 177 640 7000 89Surveys amp Polls 247 813 700 95Qualitative 245 245 unknown 65Total 669 1698

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 31

Several formulae are employed to calculate effect sizes each selected to alignto its relevant study design The most frequently used is some variation of thestandardized mean difference (Cohen 1988)

d = XT minus XC

sp(1)

where

XT = number in treatment groupXC = number in control groupSp = pooled standard deviation andd = effect size

For those studies that compare proportions rather than means an effect sizecomparable to the standard mean difference effect size is calculated with a log-oddsratio and a logit function transformation (Lipsey amp Wilson 2001 pp 53ndash58)

ES =(

loge

[PT

1 minus PT

]minus loge

[PC

1 minus PC

])183 (2)

PT is the proportion of persons in the treatment group with ldquosuccessfulrdquo out-comes (eg passing a test graduating) and PC is the proportion of persons incontrol group with successful outcomes The logit function transformation is ac-complished by dividing the log-odds ratio by 183

For survey studies the response pattern for each survey is quantitatively sum-marized in one of two ways

bull as frequencies for each point along a Likert scale which can then be summarizedby standard measures of central tendency and dispersion or

bull as frequencies for each multiple choice response option which can then beconverted to percentages

Just over 100 of the almost 800+ survey items reported their responses inmeans and standard deviations on a scale Effects for scale items are calculatedas the difference of the response mean from the scale midpoint divided by thestandard deviation

Close to 700 other survey items with multiple-choice response formats reportedthe frequency and percentage for each choice The analysis collapses these multiplechoices into dichotomous choicesmdashyes and no for and against favorable andunfavorable agree and disagree support and oppose etc Responses that fit neithersidemdashneutral no opinion donrsquot know no responsemdashare not counted An effect

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

32 PHELPS

size is then calculated with the aforementioned log-odds ratio and a logit functiontransformation

Weighting Effect sizes are summarized in two ways with an unweightedmean in which case each effect size counts the same regardless the size of thepopulation under study and with a weighted mean in which case effect sizemeasures calculated on larger groups count for more

For between 10 and 20 of the quantitative studies (depending on how oneaggregates them) that incorporated pre-post comparison designs weights (ieinverse variances) are calculated thus (Lipsey amp Wilson 2001 p 72)

w = 2N

4(1 minus r) + d2(3)

where

N = number in study populationr = correlation coefficient andd = effect size

For the remainder of the studies with standardized mean difference effectsizes weights (ie inverse variances) are calculated thus (Lipsey amp Wilson 2001p 72)

w = 2 times NT times NC(NT + NC)

2(NT + NC)2 + NT times NC times d2(4)

where

NT = number in treatment groupNC = number in control group andd = effect size

For the group of about 100 survey items with Likert-scale responses withstandardized mean effect sizes standard errors and weights (ie inverse variances)are calculated thus

SE = sradicn

w = n

s2(5 6)

wheres = standard deviation andn = number of respondents

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 33

For the group of about 700 items with proportional-difference responses withlog-odds-ratio effect sizes standard errors and weights are calculated thusly

w = 1

SE 2SE =

radic1

a+ 1

b+ 1

c+ 1

d(7 8)

where

a = number of respondents in favorb = number of respondents not in group ac = number of respondents against andd = number of respondents not in group c

Once the weights and standard errors are calculated effect sizes are thenldquoweightedrdquo by the following process (Lipsey amp Wilson 2001 pp 130ndash132)

1 Each effect size is multiplied by its weight wes2 The weighted mean effect size is calculated thusly

ES =sum

wesisumwi

(9)

Standards for judging effect sizes Cohen (1988) proposed that a d lt

020 represented a small effect and a d gt 080 represented a large effect Avalue for d in between then represented a medium effect In his meta-analysis ofmeta-analyses however Hattie (2009) placed the average d for all meta-analysesof student achievement effect studies near 040 with a pseudo-Hawthorne Effectbase level around 010 (ie the level at which any achievement-related interventionnot producing a clear positive or negative effect will settle)

RESULTS

Quantitative Studies

For quantitative studies simple and weighted effect sizes are calculated for sev-eral different aggregations (eg same treatment group control group or studyauthor) Mean effect sizes range from a moderate d asymp 055 to a fairly large d asymp088 depending on the way effects are aggregated or effect sizes are adjusted forstudy artifacts Testing with feedback produces the strongest positive effect on

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

34 PHELPS

achievement Adding stakes or testing with greater frequency also strongly andpositively affects achievement

Studies are classified with dozens of moderators including size of study pop-ulation scale of test administration responsible jurisdiction (eg nation stateclassroom) level of education and primary focus of study (eg memory reten-tion testing frequency mastery testing accountability program)

For the full panoply of 640 different effects the ldquobare bonesrdquo mean d asymp055 After adjustments for small sample sizes (Hedges 1981) and measurementuncertainty in the dependent variable (Hunter amp Schmidt 2004 chapter 8) themean d asymp 071 an apparently stronger effect5

For different aggregations of studies adjusted mean effect sizes vary Groupingstudies by the same treatment or control group or content with differing outcomemeasures (N asymp 600) produces a mean d asymp 073 Grouping studies by the sametreatment or control group with differing content or outcome measures (N asymp 430)produces a mean d asymp 075 Grouping studies by the study author such that a meanis calculated across all of a single authorrsquos studies (N = 160) produces a meand asymp 088

To illustrate the various moderatorsrsquo influence on the effect sizes the same-study-author aggregation is employed Studies in which the treatment group testedmore frequently than the control group produced a mean d asymp 085 Studies in whichthe treatment group was tested with higher stakes than the control group produceda mean d asymp 087 Studies in which the treatment group was made aware of theirtest performance or course progress (and the control group was not) produced amean d asymp 098 Studies in which the treatment group was remediated or receivedsome other type of corrective action based on their test performance produced amean d asymp 096 (See Table 7)

Note that these treatments such as testing more frequently or with higherstakes need not be mutually exclusive can be used complementarily and oftenare This suggests that the optimal intervention would not choose among theseinterventionsmdashall with strong effect sizesmdashbut rather use as many of them aspossible at the same time

TABLE 7Mean Effect Sizes for Various Treatments

Treatment Group Mean Effect Size

is made aware of performance and control group is not 098

receives targeted instruction (eg remediation) 096

is tested with higher stakes than control group 087

is tested more frequently than control group 085

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 35

TABLE 8Source Sponsor Study Design and Scale in Quantitative Studies

Number of Studies Mean Effect Size

Source of TestResearcher or Teacher 87 093National 24 087Commercial 38 082State or District 11 072Total 160

Sponsor of TestInternational 5 102Local 99 093National 45 081State 11 064Total 160

Study DesignPre-post 12 097Experiment Quasi-experiment 107 094Multivariate 26 080Experiment Posttest only 7 060Pre-post (with shadow test) 8 058Total 160

Scale of AnalysisAggregated 9 160Small-scale 118 091Large-scale 33 057Total 160

Scale of Test AdministrationClassroom 115 095Mid-scale 6 072Large-scale 39 071Total 160

Mean effect sizes can be compared between other pairs of moderators too (seeTable 8) For example small population studies such as experiments produce anaverage effect size of (091) whereas large population studies such as those usingnational populations produce an average effect size of (057) Studies of small-scale test administrations (092) produce a larger mean effect size than do studies oflarge-scale test administrations (078) Studies of local test administrations producea larger mean effect size (093) than do studies of state test administrations (064)

Studies conducted at universities (which tend to be small-scale) produce a largermean effect size (102) than do studies conducted at the elementary-secondarylevel (076) (which are often large-scale) Studies in which the intervention occurs

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

36 PHELPS

before the outcome measurement produce a larger mean effect size (093) thando ldquobackwashrdquo studies for which the intervention occurs afterwards (040) (egwhen an increase in lower-secondary test scores is observed after the introductionof a high-stakes upper-secondary examination) Studies in which the subject mattercontent of the intervention and outcome are aligned produce a larger mean effectsize (091) than do studies in which they are not aligned (078)

The quantitative studies vary dramatically in population size (from just 16 to11 million) and their effect sizes tend to correlate negatively The mean for thesmallest quintile among the full 640 effects for example is 104 whereas thatfor the largest quintile is 030 Starting with the adjusted overall mean effect sizeof 071 (for all 640 effects) the mean grows steadily larger as the studies withthe largest populations are removed For all studies with populations less than100000 the mean effect size is 073 under 10000 080 under 1000 083 under500 88 under 100 92 and under 50 106

Generally weighting reduces mean effect sizes further suggesting that thelarger the study population the weaker the results For example the unweightedeffect size of 030 for the largest study population quintile compares to its weightedcounterpart of 028 For the smallest study population quintile the effect size is104 unweighted but only 079 weighted

Survey Studies

Extracting meaning from the 800+ separate survey items required condensationFirst survey items were separated into two overarching groups in which theperception of a testing effect is either explicit or inferred Within the explicitor directly observed category survey items were further classified into two itemtypes those related to improving learning or to improving instruction

Among the 800+ individual survey items analyzed few were worded exactlylike others Often the meaning of two different questions or prompts is verysimilar however even though the wording varies Consider a question posed toUS teachers in the state of North Carolina in 2002 ldquoHave students learned morebecause of the [testing program]rdquo Eighty-three percent responded favorably and12 unfavorably while 5 were neutral (eg no opinion donrsquot know) Thissurvey item is classified in the ldquoimprove learningrdquo item group

Now consider a question posed to teachers in the Canadian Province of Albertain 1990 ldquoThe diploma examinations have positively affected the way in which Iteachrdquo Forty-one percent responded favorably 32 unfavorably and the remain-ing were neutral (or unresponsive) This survey item is classified in the ldquoimproveinstructionrdquo item group

Second the many survey items for which the respondentsrsquo perception of a test-ing effect was inferred were separated into several item groups including ldquoFavorgrade promotion examsrdquo ldquoFavor graduation examsrdquo ldquoFavor teacher competency

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 37

examsrdquo ldquoFavor teacher recertification examsrdquo ldquoFavor more testingrdquo Hold schoolsaccountable for studentsrsquo scoresrdquo and ldquoHold teachers accountable for studentsrsquoscoresrdquo

A perception of a testing effect is inferred when an expression of support for atesting program derives from a belief that the program is beneficial Conversely abelief that a testing program is harmful is inferred from an expression of opposition

For the most basic effects of testing (eg improves learning or instruction) andthe most basic uses of tests (eg graduation certification) effect sizes are largein many cases more than 10 and in some cases more than 20 Effect sizes areweaker for situations in which one group is held accountable for the performanceof anothermdashholding either teachers or schools accountable for student scores

The results vary at most negligibly with study level of rigor or test stakesAnd the results are overwhelmingly positive whether the survey item focusedon testingrsquos affect on improving learning or improving instruction Some of thestudy results can be found in the following tables Table 9 lists the unweightedand weighted effect sizes by item group for the total sample respondent popula-tion With the exception of the item group ldquohold teachers accountable for studentscoresrdquo effect sizes are positive for all item groups The most popular item groupsappear to be ldquoFavor teacher recertification examsrdquo ldquoFavor teacher competency ex-amsrdquo ldquoFavor (student) graduation examsrdquo and ldquoFavor (student) grade promotionexamsrdquo

Effect sizesmdashweighted or unweightedmdashfor these four item groups are all pos-itive and large (ie gt 08) (Cohen 1988) The effect sizes for other item groupssuch as ldquoimprove learningrdquo and ldquoimprove instructionrdquo also are positive and largewhether weighted or unweighted The remaining two item groupsmdashldquofavor moretestingrdquo and ldquohold schools accountable for (student) scoresrdquomdashgarner moderatelylarge positive effect sizes whether weighted or unweighted

Table 8 also breaks out the effect size results by two different groups of re-spondents education providers and consumers Education providers comprise pri-marily administrators and teachers whereas education consumers comprise mostlystudents and noneducator adults

As groups providers and consumers tend to respond similarlymdashstrongly andpositivelymdashto items about test effects (tests ldquoimprove learningrdquo and ldquoimproveinstructionrdquo) Likewise both groups strongly favor high-stakes testing for students(ldquofavor grade promotion examsrdquo and ldquofavor graduation examsrdquo)

Provider and consumer groups differ in their responses in the other item groupshowever For example consumers are overwhelmingly in favor of teacher testingboth competency (ie certification) and recertification exams with weighted effectsizes of 22 and 20 respectively Provider responses are positive but not nearly asstrong with weighted effect sizes of 03 and 04 respectively

Clear differences of opinion between provider and consumer groups can beseen among the remaining item groups too Consumers tend to favor more testing

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

TAB

LE9

Effe

ctS

izes

ofS

urve

yR

espo

nses

By

Item

Gro

upF

orTo

talP

opul

atio

nE

duca

tion

Pro

vide

rsa

ndE

duca

tion

Con

sum

ers

1958

ndash200

81

Tota

lE

duca

tion

Pro

vide

rs2

Edu

cati

onC

onsu

mer

s3

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofE

ffec

tSiz

eE

ffec

tSiz

eN

umbe

rof

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofIt

emTy

pe(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s

Test

ing

Impr

oves

Lea

rnin

g1

30

524

21

20

514

41

30

698

Test

ing

Impr

oves

Inst

ruct

ion

10

07

165

10

06

125

12

10

40Te

stin

gA

lign

sIn

stru

ctio

n0

60

028

Favo

rG

rade

Pro

mot

ion

Exa

ms

13

11

811

21

026

13

11

53Fa

vor

Gra

duat

ion

Exa

ms

14

11

121

12

08

381

41

281

Favo

rM

ore

Test

ing

05

07

72minus0

20

119

07

08

53Fa

vor

Teac

her

Com

pete

ncy

Exa

ms

21

08

611

10

317

25

22

44Fa

vor

Teac

her

Rec

erti

fica

tion

Exa

ms

21

15

222

52

018

Hol

dTe

ache

rsA

ccou

ntab

lefo

rS

core

sminus0

10

031

minus06

minus05

180

60

513

Hol

dS

choo

lsA

ccou

ntab

lefo

rS

core

s0

60

511

10

2minus0

338

09

07

73

Not

es1

Bla

nkce

lls

repr

esen

tnum

bers

ofit

ems

too

smal

lto

repo

rtva

lidl

y2

Edu

cati

onpr

ovid

ers

are

educ

atio

nad

min

istr

ator

spr

inci

pals

tea

cher

sed

ucat

ion

agen

cyof

fici

als

and

univ

ersi

tyfa

cult

yin

educ

atio

npr

ogra

ms

3E

duca

tion

cons

umer

sar

est

uden

tsp

aren

tst

hege

nera

lpu

blic

em

ploy

ers

publ

icof

fici

als

not

wor

king

ined

ucat

ion

agen

cies

and

univ

ersi

tyfa

cult

you

tsid

eof

educ

atio

n

38

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 39

TABLE 10The Effect of Testing on Student Achievement in Qualitative Studies

Direction of Effect Number of Studies Percent of Studies Percent wo Inferred

Positive 204 84 93Positive Inferred 24 10No Change 8 3 4Mixed 5 2 2Negative 3 1 1Total 244 100 100

and holding teachers and schools accountable for student test scores Providerseither are less supportive in these item groups or are opposed For exampleprovidersrsquo weighted effect size for ldquohold teachers accountable for students scoresrdquois a moderately large ndash05

Qualitative Studies

Analysis of the qualitative studies focuses on the direction of the testing effecton achievement or instruction Ninety-three percent of the qualitative studiesanalyzed reported positive effects of testing whereas only 7 reported mixedeffects negative effects or no change

In 24 cases a positive effect on student achievement was not declared butreasonably could be inferred from statements about behavioral changes Onestudy for example reported ldquo[using] results from test to improve courseworkrdquoIf coursework was indeed improved one might reasonably assume that studentachievement benefited as well Whether the ldquopositive inferredrdquo studies are includedin the summary or not however the counts indicate that far more studies findpositive than mixed negative or no effects The main results of this researchsummary are displayed in Table 10

DISCUSSION

One hundred yearsrsquo evidence suggests that testing increases achievement Effectsare moderately to strongly positive for quantitative studies and are very stronglypositive for survey and qualitative studies

The overwhelmingly positive results of the qualitative research review in partic-ular may surprise some readers The results should be considered reliable becausethe base of evidence is so largemdash245 studies conducted over the course of acentury in more than 30 countries

Qualitative studies have been held up by testing opponents as a higher standardfor studies of educational impact Labeling quantitative studies as narrow limited

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 5: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

THE EFFECT OF TESTING ON ACHIEVEMENT 25

have been tested more frequently than the other tested with higher stakes (iegreater consequences) or provided one of two types of feedbackmdashsimply beingmade aware of their test performance or course progress or given remediation oranother type of corrective action based on their test results

In other studies the experimental group was told that a test would be countedas part of their grade and the control group was told that it would not In a fewexperiments one group was given take-home tests during the course and theother group in-class tests On the final exam the in-class tested students tended toperform better perhaps because the take-homendashtested students limited their reviewof the material to the topics on the take-home tests whereas the in-classndashtestedstudents prepared themselves to be tested on all topics

A small minority of studies employed pre-post comparisons of either the samepopulation (eg 4th-graders one year and 4th-graders in the same jurisdiction alater year) the same cohort (eg 4th-graders followed longitudinally from oneyear to a later year) or a synthetic cohort (eg a sample of 4th-graders one yearcompared to a sample of 8th-graders in the same jurisdiction four years later)

Historically effect-of-testing research has accompanied interest in one or an-other of the specific types of interventions Mastery testing studies for examplewere frequent from the 1960s through the 1980s Accountability studies coincidedwith the popularity (in the United States at least) of ldquominimum competencyrdquo test-ing from the 1970s on Frequency-of-testing studies were most popular in the firsthalf of the twentieth century Memory retention studies were frequently conductedthen too but have also encountered resurgence in popularity in recent years

Table 1 describes this collection of quantitative studies by various attributesincluding study design and the source sponsor and scale of tests used in thestudies

Survey studies Surveys are a type of quantitative study that measures per-ceptions of effectsmdasheither through public opinion polls or surveys of groupsselected to complete a questionnaire as part of a program evaluation Some surveystudies contain multiple relevant items andor posed relevant questions to multiplerespondent groups (eg teachers parents) Two hundred forty-seven survey stud-ies conducted in the United States and Canada were reviewed summarizing about700000 separate individual responses to questions regarding testingrsquos effect onlearning and instruction and preferences for common test types

All told 813 individual item-response group combinations (hereafter calledldquoitemsrdquo) were identified For example if a polling firm posed the same question totwo different respondent groups (eg teachers parents) each item-response groupcombination was counted as a separate item As was inevitable some individualrespondents are represented more than once but never for the same item

Figure 1 breaks out survey items by data sourcemdashpublic opinion poll or surveyembedded in a program evaluationmdashand respondent groupmdasheducation provider

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

26 PHELPS

TABLE 1Source Sponsor Study Design and Scale of Tests in Quantitative Studies

Number of Studies Percent of Studies

Source of TestResearcher or Teacher 87 54Commercial 38 24National 24 15State or District 11 7Total 160 100

Sponsor of TestLocal 99 62National 45 28State 11 7International 5 3Total 160 100

Study DesignExperiment Quasi-experiment 107 67Multivariate 26 16Pre-post 12 8Pre-post (with shadow test) 8 5Experiment Posttest only 7 4Total 160 100

Scale of Test AdministrationClassroom 115 72Large-scale 39 24Mid-scale 6 4Total 160 100

or education consumer The education provider group comprises administratorsboard members teachers counselors and education professors The educationconsumer group comprises the public parents students employers politiciansand tertiary education faculty

Figure 1 reveals a fairly even division of sources among the 800+ surveyitems between public opinion polls (55) and program evaluation surveys (45)Likewise an almost even division of respondent group types between educationproviders (48) and education consumers (52) was found

Among the subtotals however asymmetries emerge Almost five times asmany evaluation survey items were posed to education providers than to educationconsumers Conversely almost three times as many public opinion poll items wereposed to education consumers than to education providers

Table 2 reveals that a majority of the survey items (62) concern high-stakestests (ie tests with consequences for students teachers or schools such asretention in grade [for a student] or licensure denial [for a teacher]) This stands to

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 27

05

10

1520253035

404550

Public Opinion Polls Program EvaluationSurveys

Per

cen

t EducationProviders

EducationConsumers

FIGURE 1Percentage of survey items By respondent group and type of survey

Note lowastThree program evaluation survey items were posed to a combinedprovider-consumer group and for the purpose of this figure those three items have been

counted twice

reason as high-stakes tests are more politically controversial and more familiar tothe public so pollsters and program evaluators are more likely to ask about them

Table 2 also classifies the survey items by the target of the stakesmdashthe groupor groups bearing the consequences of good or poor test performance For aplurality of items students bore the consequences (46) Schools (33) and

TABLE 2Number and Percent of Survey Items By Test Stakes and Their Target Group

Number of Survey Items Percent of Survey Items

Level of StakesHigh 507 62Medium 184 23Unknown 89 11Low 33 4Total 813 100

Target GroupStudents 393 46Schools 281 33Teachers 116 14No Stakes 64 7Total 854 100

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

28 PHELPS

teachers (14) were less often the target of test stakes In some cases stakes wereapplicable to more than one group

Qualitative studies Any study for which an effect size could not be com-puted was designated as qualitative Typically qualitative studies declare evidenceof effects in categorical terms and incorporate more ldquohands onrdquo research methodssuch as direct observations site visits interviews or case studies Here analysis of244 qualitative studies focuses exclusively on the reported direction of the testingeffect on achievement

To determine whether a qualitative study indicated a positive neutral or nega-tive effect of testing on achievement a specific qualitative meta-analysis approachcalled template analysis was employed (eg Crabtree amp Miller 1999 King 19982006 Miles amp Huberman 1994) A hierarchical coding rubric was developed tosummarize and organize the studies according to themes predetermined as im-portant by previous research and informed by ongoing analysis of the data setThe template establishes broad themes (eg stakes) initially then successivelysmaller themes (eg high medium or low) are generated within each broadtheme The template rubric comprised 6 themes and 27 subthemes identified inTable 3

All studies were reviewed and coded by three reviewers a doctoral student inmeta-analysis an editor and me Any disparities in coding were discussed amongreviewers in an attempt to reach a consensus In less than 10 cases in which a clearconsensus was not reached the author made the coding decisions

Qualitative studies used various research methods and in various combinationsIndeed 37 of the 245 studies incorporated more than one method (Table 4)

Table 5 categorizes the qualitative studies according to the level of educationand the scale stakes and target of the stakes of the tests involved Most (81)studies included in this analysis were large-scale while 17 of the studies wereat the classroom level Two percent of the studies focused on testing for teachers

Summary Table 6 lists the overall numbers of studies and effects bytypemdashquantitative survey and qualitativemdashand geographic coveragemdashUS ornon-US Several studies fit more than one category (eg a program evaluationwith both survey and observational case study components)

Calculating Effects

For quantitative and survey studies effect sizes were calculated and the datasummarized with weighted and unweighted means Qualitative studies that reacheda conclusion about the effect of testing on student achievement or on instructionwere tallied as positive negative and in-between

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 29

TABLE 3Coding Themes and Subthemes Used in Template Analysis

Themes Subthemes

Scale of Test bull Large-Scalebull Classroombull Test for Teachersbull Not Specified

Stakes of Test bull Highbull Mediumbull Lowbull Variesbull Not Specified

Target of Stakes bull Studentbull Teacherbull School

Research Methods bull Case Studybull InterviewFocus Groupbull JournalWork Logbull RecordsDocument Reviewbull Research Reviewbull ExperimentPre-Post Comparison for which No

Effect Size Was Reportedbull Survey for which No Effect Size Was Reported

Rigor of Qualitative Study bull Highbull Mediumbull Low

Direction of Testing Effects bull Positivebull Positive Inferredbull Mixedbull No Changebull Negative

TABLE 4Research Method Used in Qualitative Studies

ThemesSubthemes Number of Studies Percent of Studies

Case Study 120 43Interview (Includes Focus Group) 75 27Records or Document Review 33 12Survey 22 8Experiment or Pre-Post Comparison 21 7Research Review 8 3Journals or Work Logs 2 1Total 281 101

Note Percentage sums to greater than 100 due to rounding

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

30 PHELPS

TABLE 5Level of Education Scale Stakes and Target of Stakes in Qualitative Studies

Number of Studies Percent of Studies

Level of EducationTwo or More Levels 115 50Upper Secondary 55 24Elementary 34 15Postsecondary 18 8Lower Secondary 7 3Adult Education 3 1Total 232 101

Scale of TestLarge-scale 196 81Classroom 42 17Test for Teachers 5 2Total 243 100

Stakes of TestHigh 154 63Low 51 21Medium 33 14VariesNot Specified 6 3Total 244 101

Target of StakesStudent 150 62School 80 33Teacher 10 4Varies 1 lt1Total 241 100

Note Percentages may sum to greater than 100 due to rounding

TABLE 6Number of Studies and Effects By Type

Type of Number of Number of Population Coverage Percent of StudiesStudy Studies Effects (in thousands) with US Focus

Quantitative 177 640 7000 89Surveys amp Polls 247 813 700 95Qualitative 245 245 unknown 65Total 669 1698

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 31

Several formulae are employed to calculate effect sizes each selected to alignto its relevant study design The most frequently used is some variation of thestandardized mean difference (Cohen 1988)

d = XT minus XC

sp(1)

where

XT = number in treatment groupXC = number in control groupSp = pooled standard deviation andd = effect size

For those studies that compare proportions rather than means an effect sizecomparable to the standard mean difference effect size is calculated with a log-oddsratio and a logit function transformation (Lipsey amp Wilson 2001 pp 53ndash58)

ES =(

loge

[PT

1 minus PT

]minus loge

[PC

1 minus PC

])183 (2)

PT is the proportion of persons in the treatment group with ldquosuccessfulrdquo out-comes (eg passing a test graduating) and PC is the proportion of persons incontrol group with successful outcomes The logit function transformation is ac-complished by dividing the log-odds ratio by 183

For survey studies the response pattern for each survey is quantitatively sum-marized in one of two ways

bull as frequencies for each point along a Likert scale which can then be summarizedby standard measures of central tendency and dispersion or

bull as frequencies for each multiple choice response option which can then beconverted to percentages

Just over 100 of the almost 800+ survey items reported their responses inmeans and standard deviations on a scale Effects for scale items are calculatedas the difference of the response mean from the scale midpoint divided by thestandard deviation

Close to 700 other survey items with multiple-choice response formats reportedthe frequency and percentage for each choice The analysis collapses these multiplechoices into dichotomous choicesmdashyes and no for and against favorable andunfavorable agree and disagree support and oppose etc Responses that fit neithersidemdashneutral no opinion donrsquot know no responsemdashare not counted An effect

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

32 PHELPS

size is then calculated with the aforementioned log-odds ratio and a logit functiontransformation

Weighting Effect sizes are summarized in two ways with an unweightedmean in which case each effect size counts the same regardless the size of thepopulation under study and with a weighted mean in which case effect sizemeasures calculated on larger groups count for more

For between 10 and 20 of the quantitative studies (depending on how oneaggregates them) that incorporated pre-post comparison designs weights (ieinverse variances) are calculated thus (Lipsey amp Wilson 2001 p 72)

w = 2N

4(1 minus r) + d2(3)

where

N = number in study populationr = correlation coefficient andd = effect size

For the remainder of the studies with standardized mean difference effectsizes weights (ie inverse variances) are calculated thus (Lipsey amp Wilson 2001p 72)

w = 2 times NT times NC(NT + NC)

2(NT + NC)2 + NT times NC times d2(4)

where

NT = number in treatment groupNC = number in control group andd = effect size

For the group of about 100 survey items with Likert-scale responses withstandardized mean effect sizes standard errors and weights (ie inverse variances)are calculated thus

SE = sradicn

w = n

s2(5 6)

wheres = standard deviation andn = number of respondents

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 33

For the group of about 700 items with proportional-difference responses withlog-odds-ratio effect sizes standard errors and weights are calculated thusly

w = 1

SE 2SE =

radic1

a+ 1

b+ 1

c+ 1

d(7 8)

where

a = number of respondents in favorb = number of respondents not in group ac = number of respondents against andd = number of respondents not in group c

Once the weights and standard errors are calculated effect sizes are thenldquoweightedrdquo by the following process (Lipsey amp Wilson 2001 pp 130ndash132)

1 Each effect size is multiplied by its weight wes2 The weighted mean effect size is calculated thusly

ES =sum

wesisumwi

(9)

Standards for judging effect sizes Cohen (1988) proposed that a d lt

020 represented a small effect and a d gt 080 represented a large effect Avalue for d in between then represented a medium effect In his meta-analysis ofmeta-analyses however Hattie (2009) placed the average d for all meta-analysesof student achievement effect studies near 040 with a pseudo-Hawthorne Effectbase level around 010 (ie the level at which any achievement-related interventionnot producing a clear positive or negative effect will settle)

RESULTS

Quantitative Studies

For quantitative studies simple and weighted effect sizes are calculated for sev-eral different aggregations (eg same treatment group control group or studyauthor) Mean effect sizes range from a moderate d asymp 055 to a fairly large d asymp088 depending on the way effects are aggregated or effect sizes are adjusted forstudy artifacts Testing with feedback produces the strongest positive effect on

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

34 PHELPS

achievement Adding stakes or testing with greater frequency also strongly andpositively affects achievement

Studies are classified with dozens of moderators including size of study pop-ulation scale of test administration responsible jurisdiction (eg nation stateclassroom) level of education and primary focus of study (eg memory reten-tion testing frequency mastery testing accountability program)

For the full panoply of 640 different effects the ldquobare bonesrdquo mean d asymp055 After adjustments for small sample sizes (Hedges 1981) and measurementuncertainty in the dependent variable (Hunter amp Schmidt 2004 chapter 8) themean d asymp 071 an apparently stronger effect5

For different aggregations of studies adjusted mean effect sizes vary Groupingstudies by the same treatment or control group or content with differing outcomemeasures (N asymp 600) produces a mean d asymp 073 Grouping studies by the sametreatment or control group with differing content or outcome measures (N asymp 430)produces a mean d asymp 075 Grouping studies by the study author such that a meanis calculated across all of a single authorrsquos studies (N = 160) produces a meand asymp 088

To illustrate the various moderatorsrsquo influence on the effect sizes the same-study-author aggregation is employed Studies in which the treatment group testedmore frequently than the control group produced a mean d asymp 085 Studies in whichthe treatment group was tested with higher stakes than the control group produceda mean d asymp 087 Studies in which the treatment group was made aware of theirtest performance or course progress (and the control group was not) produced amean d asymp 098 Studies in which the treatment group was remediated or receivedsome other type of corrective action based on their test performance produced amean d asymp 096 (See Table 7)

Note that these treatments such as testing more frequently or with higherstakes need not be mutually exclusive can be used complementarily and oftenare This suggests that the optimal intervention would not choose among theseinterventionsmdashall with strong effect sizesmdashbut rather use as many of them aspossible at the same time

TABLE 7Mean Effect Sizes for Various Treatments

Treatment Group Mean Effect Size

is made aware of performance and control group is not 098

receives targeted instruction (eg remediation) 096

is tested with higher stakes than control group 087

is tested more frequently than control group 085

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 35

TABLE 8Source Sponsor Study Design and Scale in Quantitative Studies

Number of Studies Mean Effect Size

Source of TestResearcher or Teacher 87 093National 24 087Commercial 38 082State or District 11 072Total 160

Sponsor of TestInternational 5 102Local 99 093National 45 081State 11 064Total 160

Study DesignPre-post 12 097Experiment Quasi-experiment 107 094Multivariate 26 080Experiment Posttest only 7 060Pre-post (with shadow test) 8 058Total 160

Scale of AnalysisAggregated 9 160Small-scale 118 091Large-scale 33 057Total 160

Scale of Test AdministrationClassroom 115 095Mid-scale 6 072Large-scale 39 071Total 160

Mean effect sizes can be compared between other pairs of moderators too (seeTable 8) For example small population studies such as experiments produce anaverage effect size of (091) whereas large population studies such as those usingnational populations produce an average effect size of (057) Studies of small-scale test administrations (092) produce a larger mean effect size than do studies oflarge-scale test administrations (078) Studies of local test administrations producea larger mean effect size (093) than do studies of state test administrations (064)

Studies conducted at universities (which tend to be small-scale) produce a largermean effect size (102) than do studies conducted at the elementary-secondarylevel (076) (which are often large-scale) Studies in which the intervention occurs

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

36 PHELPS

before the outcome measurement produce a larger mean effect size (093) thando ldquobackwashrdquo studies for which the intervention occurs afterwards (040) (egwhen an increase in lower-secondary test scores is observed after the introductionof a high-stakes upper-secondary examination) Studies in which the subject mattercontent of the intervention and outcome are aligned produce a larger mean effectsize (091) than do studies in which they are not aligned (078)

The quantitative studies vary dramatically in population size (from just 16 to11 million) and their effect sizes tend to correlate negatively The mean for thesmallest quintile among the full 640 effects for example is 104 whereas thatfor the largest quintile is 030 Starting with the adjusted overall mean effect sizeof 071 (for all 640 effects) the mean grows steadily larger as the studies withthe largest populations are removed For all studies with populations less than100000 the mean effect size is 073 under 10000 080 under 1000 083 under500 88 under 100 92 and under 50 106

Generally weighting reduces mean effect sizes further suggesting that thelarger the study population the weaker the results For example the unweightedeffect size of 030 for the largest study population quintile compares to its weightedcounterpart of 028 For the smallest study population quintile the effect size is104 unweighted but only 079 weighted

Survey Studies

Extracting meaning from the 800+ separate survey items required condensationFirst survey items were separated into two overarching groups in which theperception of a testing effect is either explicit or inferred Within the explicitor directly observed category survey items were further classified into two itemtypes those related to improving learning or to improving instruction

Among the 800+ individual survey items analyzed few were worded exactlylike others Often the meaning of two different questions or prompts is verysimilar however even though the wording varies Consider a question posed toUS teachers in the state of North Carolina in 2002 ldquoHave students learned morebecause of the [testing program]rdquo Eighty-three percent responded favorably and12 unfavorably while 5 were neutral (eg no opinion donrsquot know) Thissurvey item is classified in the ldquoimprove learningrdquo item group

Now consider a question posed to teachers in the Canadian Province of Albertain 1990 ldquoThe diploma examinations have positively affected the way in which Iteachrdquo Forty-one percent responded favorably 32 unfavorably and the remain-ing were neutral (or unresponsive) This survey item is classified in the ldquoimproveinstructionrdquo item group

Second the many survey items for which the respondentsrsquo perception of a test-ing effect was inferred were separated into several item groups including ldquoFavorgrade promotion examsrdquo ldquoFavor graduation examsrdquo ldquoFavor teacher competency

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 37

examsrdquo ldquoFavor teacher recertification examsrdquo ldquoFavor more testingrdquo Hold schoolsaccountable for studentsrsquo scoresrdquo and ldquoHold teachers accountable for studentsrsquoscoresrdquo

A perception of a testing effect is inferred when an expression of support for atesting program derives from a belief that the program is beneficial Conversely abelief that a testing program is harmful is inferred from an expression of opposition

For the most basic effects of testing (eg improves learning or instruction) andthe most basic uses of tests (eg graduation certification) effect sizes are largein many cases more than 10 and in some cases more than 20 Effect sizes areweaker for situations in which one group is held accountable for the performanceof anothermdashholding either teachers or schools accountable for student scores

The results vary at most negligibly with study level of rigor or test stakesAnd the results are overwhelmingly positive whether the survey item focusedon testingrsquos affect on improving learning or improving instruction Some of thestudy results can be found in the following tables Table 9 lists the unweightedand weighted effect sizes by item group for the total sample respondent popula-tion With the exception of the item group ldquohold teachers accountable for studentscoresrdquo effect sizes are positive for all item groups The most popular item groupsappear to be ldquoFavor teacher recertification examsrdquo ldquoFavor teacher competency ex-amsrdquo ldquoFavor (student) graduation examsrdquo and ldquoFavor (student) grade promotionexamsrdquo

Effect sizesmdashweighted or unweightedmdashfor these four item groups are all pos-itive and large (ie gt 08) (Cohen 1988) The effect sizes for other item groupssuch as ldquoimprove learningrdquo and ldquoimprove instructionrdquo also are positive and largewhether weighted or unweighted The remaining two item groupsmdashldquofavor moretestingrdquo and ldquohold schools accountable for (student) scoresrdquomdashgarner moderatelylarge positive effect sizes whether weighted or unweighted

Table 8 also breaks out the effect size results by two different groups of re-spondents education providers and consumers Education providers comprise pri-marily administrators and teachers whereas education consumers comprise mostlystudents and noneducator adults

As groups providers and consumers tend to respond similarlymdashstrongly andpositivelymdashto items about test effects (tests ldquoimprove learningrdquo and ldquoimproveinstructionrdquo) Likewise both groups strongly favor high-stakes testing for students(ldquofavor grade promotion examsrdquo and ldquofavor graduation examsrdquo)

Provider and consumer groups differ in their responses in the other item groupshowever For example consumers are overwhelmingly in favor of teacher testingboth competency (ie certification) and recertification exams with weighted effectsizes of 22 and 20 respectively Provider responses are positive but not nearly asstrong with weighted effect sizes of 03 and 04 respectively

Clear differences of opinion between provider and consumer groups can beseen among the remaining item groups too Consumers tend to favor more testing

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

TAB

LE9

Effe

ctS

izes

ofS

urve

yR

espo

nses

By

Item

Gro

upF

orTo

talP

opul

atio

nE

duca

tion

Pro

vide

rsa

ndE

duca

tion

Con

sum

ers

1958

ndash200

81

Tota

lE

duca

tion

Pro

vide

rs2

Edu

cati

onC

onsu

mer

s3

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofE

ffec

tSiz

eE

ffec

tSiz

eN

umbe

rof

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofIt

emTy

pe(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s

Test

ing

Impr

oves

Lea

rnin

g1

30

524

21

20

514

41

30

698

Test

ing

Impr

oves

Inst

ruct

ion

10

07

165

10

06

125

12

10

40Te

stin

gA

lign

sIn

stru

ctio

n0

60

028

Favo

rG

rade

Pro

mot

ion

Exa

ms

13

11

811

21

026

13

11

53Fa

vor

Gra

duat

ion

Exa

ms

14

11

121

12

08

381

41

281

Favo

rM

ore

Test

ing

05

07

72minus0

20

119

07

08

53Fa

vor

Teac

her

Com

pete

ncy

Exa

ms

21

08

611

10

317

25

22

44Fa

vor

Teac

her

Rec

erti

fica

tion

Exa

ms

21

15

222

52

018

Hol

dTe

ache

rsA

ccou

ntab

lefo

rS

core

sminus0

10

031

minus06

minus05

180

60

513

Hol

dS

choo

lsA

ccou

ntab

lefo

rS

core

s0

60

511

10

2minus0

338

09

07

73

Not

es1

Bla

nkce

lls

repr

esen

tnum

bers

ofit

ems

too

smal

lto

repo

rtva

lidl

y2

Edu

cati

onpr

ovid

ers

are

educ

atio

nad

min

istr

ator

spr

inci

pals

tea

cher

sed

ucat

ion

agen

cyof

fici

als

and

univ

ersi

tyfa

cult

yin

educ

atio

npr

ogra

ms

3E

duca

tion

cons

umer

sar

est

uden

tsp

aren

tst

hege

nera

lpu

blic

em

ploy

ers

publ

icof

fici

als

not

wor

king

ined

ucat

ion

agen

cies

and

univ

ersi

tyfa

cult

you

tsid

eof

educ

atio

n

38

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 39

TABLE 10The Effect of Testing on Student Achievement in Qualitative Studies

Direction of Effect Number of Studies Percent of Studies Percent wo Inferred

Positive 204 84 93Positive Inferred 24 10No Change 8 3 4Mixed 5 2 2Negative 3 1 1Total 244 100 100

and holding teachers and schools accountable for student test scores Providerseither are less supportive in these item groups or are opposed For exampleprovidersrsquo weighted effect size for ldquohold teachers accountable for students scoresrdquois a moderately large ndash05

Qualitative Studies

Analysis of the qualitative studies focuses on the direction of the testing effecton achievement or instruction Ninety-three percent of the qualitative studiesanalyzed reported positive effects of testing whereas only 7 reported mixedeffects negative effects or no change

In 24 cases a positive effect on student achievement was not declared butreasonably could be inferred from statements about behavioral changes Onestudy for example reported ldquo[using] results from test to improve courseworkrdquoIf coursework was indeed improved one might reasonably assume that studentachievement benefited as well Whether the ldquopositive inferredrdquo studies are includedin the summary or not however the counts indicate that far more studies findpositive than mixed negative or no effects The main results of this researchsummary are displayed in Table 10

DISCUSSION

One hundred yearsrsquo evidence suggests that testing increases achievement Effectsare moderately to strongly positive for quantitative studies and are very stronglypositive for survey and qualitative studies

The overwhelmingly positive results of the qualitative research review in partic-ular may surprise some readers The results should be considered reliable becausethe base of evidence is so largemdash245 studies conducted over the course of acentury in more than 30 countries

Qualitative studies have been held up by testing opponents as a higher standardfor studies of educational impact Labeling quantitative studies as narrow limited

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 6: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

26 PHELPS

TABLE 1Source Sponsor Study Design and Scale of Tests in Quantitative Studies

Number of Studies Percent of Studies

Source of TestResearcher or Teacher 87 54Commercial 38 24National 24 15State or District 11 7Total 160 100

Sponsor of TestLocal 99 62National 45 28State 11 7International 5 3Total 160 100

Study DesignExperiment Quasi-experiment 107 67Multivariate 26 16Pre-post 12 8Pre-post (with shadow test) 8 5Experiment Posttest only 7 4Total 160 100

Scale of Test AdministrationClassroom 115 72Large-scale 39 24Mid-scale 6 4Total 160 100

or education consumer The education provider group comprises administratorsboard members teachers counselors and education professors The educationconsumer group comprises the public parents students employers politiciansand tertiary education faculty

Figure 1 reveals a fairly even division of sources among the 800+ surveyitems between public opinion polls (55) and program evaluation surveys (45)Likewise an almost even division of respondent group types between educationproviders (48) and education consumers (52) was found

Among the subtotals however asymmetries emerge Almost five times asmany evaluation survey items were posed to education providers than to educationconsumers Conversely almost three times as many public opinion poll items wereposed to education consumers than to education providers

Table 2 reveals that a majority of the survey items (62) concern high-stakestests (ie tests with consequences for students teachers or schools such asretention in grade [for a student] or licensure denial [for a teacher]) This stands to

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 27

05

10

1520253035

404550

Public Opinion Polls Program EvaluationSurveys

Per

cen

t EducationProviders

EducationConsumers

FIGURE 1Percentage of survey items By respondent group and type of survey

Note lowastThree program evaluation survey items were posed to a combinedprovider-consumer group and for the purpose of this figure those three items have been

counted twice

reason as high-stakes tests are more politically controversial and more familiar tothe public so pollsters and program evaluators are more likely to ask about them

Table 2 also classifies the survey items by the target of the stakesmdashthe groupor groups bearing the consequences of good or poor test performance For aplurality of items students bore the consequences (46) Schools (33) and

TABLE 2Number and Percent of Survey Items By Test Stakes and Their Target Group

Number of Survey Items Percent of Survey Items

Level of StakesHigh 507 62Medium 184 23Unknown 89 11Low 33 4Total 813 100

Target GroupStudents 393 46Schools 281 33Teachers 116 14No Stakes 64 7Total 854 100

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

28 PHELPS

teachers (14) were less often the target of test stakes In some cases stakes wereapplicable to more than one group

Qualitative studies Any study for which an effect size could not be com-puted was designated as qualitative Typically qualitative studies declare evidenceof effects in categorical terms and incorporate more ldquohands onrdquo research methodssuch as direct observations site visits interviews or case studies Here analysis of244 qualitative studies focuses exclusively on the reported direction of the testingeffect on achievement

To determine whether a qualitative study indicated a positive neutral or nega-tive effect of testing on achievement a specific qualitative meta-analysis approachcalled template analysis was employed (eg Crabtree amp Miller 1999 King 19982006 Miles amp Huberman 1994) A hierarchical coding rubric was developed tosummarize and organize the studies according to themes predetermined as im-portant by previous research and informed by ongoing analysis of the data setThe template establishes broad themes (eg stakes) initially then successivelysmaller themes (eg high medium or low) are generated within each broadtheme The template rubric comprised 6 themes and 27 subthemes identified inTable 3

All studies were reviewed and coded by three reviewers a doctoral student inmeta-analysis an editor and me Any disparities in coding were discussed amongreviewers in an attempt to reach a consensus In less than 10 cases in which a clearconsensus was not reached the author made the coding decisions

Qualitative studies used various research methods and in various combinationsIndeed 37 of the 245 studies incorporated more than one method (Table 4)

Table 5 categorizes the qualitative studies according to the level of educationand the scale stakes and target of the stakes of the tests involved Most (81)studies included in this analysis were large-scale while 17 of the studies wereat the classroom level Two percent of the studies focused on testing for teachers

Summary Table 6 lists the overall numbers of studies and effects bytypemdashquantitative survey and qualitativemdashand geographic coveragemdashUS ornon-US Several studies fit more than one category (eg a program evaluationwith both survey and observational case study components)

Calculating Effects

For quantitative and survey studies effect sizes were calculated and the datasummarized with weighted and unweighted means Qualitative studies that reacheda conclusion about the effect of testing on student achievement or on instructionwere tallied as positive negative and in-between

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 29

TABLE 3Coding Themes and Subthemes Used in Template Analysis

Themes Subthemes

Scale of Test bull Large-Scalebull Classroombull Test for Teachersbull Not Specified

Stakes of Test bull Highbull Mediumbull Lowbull Variesbull Not Specified

Target of Stakes bull Studentbull Teacherbull School

Research Methods bull Case Studybull InterviewFocus Groupbull JournalWork Logbull RecordsDocument Reviewbull Research Reviewbull ExperimentPre-Post Comparison for which No

Effect Size Was Reportedbull Survey for which No Effect Size Was Reported

Rigor of Qualitative Study bull Highbull Mediumbull Low

Direction of Testing Effects bull Positivebull Positive Inferredbull Mixedbull No Changebull Negative

TABLE 4Research Method Used in Qualitative Studies

ThemesSubthemes Number of Studies Percent of Studies

Case Study 120 43Interview (Includes Focus Group) 75 27Records or Document Review 33 12Survey 22 8Experiment or Pre-Post Comparison 21 7Research Review 8 3Journals or Work Logs 2 1Total 281 101

Note Percentage sums to greater than 100 due to rounding

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

30 PHELPS

TABLE 5Level of Education Scale Stakes and Target of Stakes in Qualitative Studies

Number of Studies Percent of Studies

Level of EducationTwo or More Levels 115 50Upper Secondary 55 24Elementary 34 15Postsecondary 18 8Lower Secondary 7 3Adult Education 3 1Total 232 101

Scale of TestLarge-scale 196 81Classroom 42 17Test for Teachers 5 2Total 243 100

Stakes of TestHigh 154 63Low 51 21Medium 33 14VariesNot Specified 6 3Total 244 101

Target of StakesStudent 150 62School 80 33Teacher 10 4Varies 1 lt1Total 241 100

Note Percentages may sum to greater than 100 due to rounding

TABLE 6Number of Studies and Effects By Type

Type of Number of Number of Population Coverage Percent of StudiesStudy Studies Effects (in thousands) with US Focus

Quantitative 177 640 7000 89Surveys amp Polls 247 813 700 95Qualitative 245 245 unknown 65Total 669 1698

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 31

Several formulae are employed to calculate effect sizes each selected to alignto its relevant study design The most frequently used is some variation of thestandardized mean difference (Cohen 1988)

d = XT minus XC

sp(1)

where

XT = number in treatment groupXC = number in control groupSp = pooled standard deviation andd = effect size

For those studies that compare proportions rather than means an effect sizecomparable to the standard mean difference effect size is calculated with a log-oddsratio and a logit function transformation (Lipsey amp Wilson 2001 pp 53ndash58)

ES =(

loge

[PT

1 minus PT

]minus loge

[PC

1 minus PC

])183 (2)

PT is the proportion of persons in the treatment group with ldquosuccessfulrdquo out-comes (eg passing a test graduating) and PC is the proportion of persons incontrol group with successful outcomes The logit function transformation is ac-complished by dividing the log-odds ratio by 183

For survey studies the response pattern for each survey is quantitatively sum-marized in one of two ways

bull as frequencies for each point along a Likert scale which can then be summarizedby standard measures of central tendency and dispersion or

bull as frequencies for each multiple choice response option which can then beconverted to percentages

Just over 100 of the almost 800+ survey items reported their responses inmeans and standard deviations on a scale Effects for scale items are calculatedas the difference of the response mean from the scale midpoint divided by thestandard deviation

Close to 700 other survey items with multiple-choice response formats reportedthe frequency and percentage for each choice The analysis collapses these multiplechoices into dichotomous choicesmdashyes and no for and against favorable andunfavorable agree and disagree support and oppose etc Responses that fit neithersidemdashneutral no opinion donrsquot know no responsemdashare not counted An effect

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

32 PHELPS

size is then calculated with the aforementioned log-odds ratio and a logit functiontransformation

Weighting Effect sizes are summarized in two ways with an unweightedmean in which case each effect size counts the same regardless the size of thepopulation under study and with a weighted mean in which case effect sizemeasures calculated on larger groups count for more

For between 10 and 20 of the quantitative studies (depending on how oneaggregates them) that incorporated pre-post comparison designs weights (ieinverse variances) are calculated thus (Lipsey amp Wilson 2001 p 72)

w = 2N

4(1 minus r) + d2(3)

where

N = number in study populationr = correlation coefficient andd = effect size

For the remainder of the studies with standardized mean difference effectsizes weights (ie inverse variances) are calculated thus (Lipsey amp Wilson 2001p 72)

w = 2 times NT times NC(NT + NC)

2(NT + NC)2 + NT times NC times d2(4)

where

NT = number in treatment groupNC = number in control group andd = effect size

For the group of about 100 survey items with Likert-scale responses withstandardized mean effect sizes standard errors and weights (ie inverse variances)are calculated thus

SE = sradicn

w = n

s2(5 6)

wheres = standard deviation andn = number of respondents

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 33

For the group of about 700 items with proportional-difference responses withlog-odds-ratio effect sizes standard errors and weights are calculated thusly

w = 1

SE 2SE =

radic1

a+ 1

b+ 1

c+ 1

d(7 8)

where

a = number of respondents in favorb = number of respondents not in group ac = number of respondents against andd = number of respondents not in group c

Once the weights and standard errors are calculated effect sizes are thenldquoweightedrdquo by the following process (Lipsey amp Wilson 2001 pp 130ndash132)

1 Each effect size is multiplied by its weight wes2 The weighted mean effect size is calculated thusly

ES =sum

wesisumwi

(9)

Standards for judging effect sizes Cohen (1988) proposed that a d lt

020 represented a small effect and a d gt 080 represented a large effect Avalue for d in between then represented a medium effect In his meta-analysis ofmeta-analyses however Hattie (2009) placed the average d for all meta-analysesof student achievement effect studies near 040 with a pseudo-Hawthorne Effectbase level around 010 (ie the level at which any achievement-related interventionnot producing a clear positive or negative effect will settle)

RESULTS

Quantitative Studies

For quantitative studies simple and weighted effect sizes are calculated for sev-eral different aggregations (eg same treatment group control group or studyauthor) Mean effect sizes range from a moderate d asymp 055 to a fairly large d asymp088 depending on the way effects are aggregated or effect sizes are adjusted forstudy artifacts Testing with feedback produces the strongest positive effect on

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

34 PHELPS

achievement Adding stakes or testing with greater frequency also strongly andpositively affects achievement

Studies are classified with dozens of moderators including size of study pop-ulation scale of test administration responsible jurisdiction (eg nation stateclassroom) level of education and primary focus of study (eg memory reten-tion testing frequency mastery testing accountability program)

For the full panoply of 640 different effects the ldquobare bonesrdquo mean d asymp055 After adjustments for small sample sizes (Hedges 1981) and measurementuncertainty in the dependent variable (Hunter amp Schmidt 2004 chapter 8) themean d asymp 071 an apparently stronger effect5

For different aggregations of studies adjusted mean effect sizes vary Groupingstudies by the same treatment or control group or content with differing outcomemeasures (N asymp 600) produces a mean d asymp 073 Grouping studies by the sametreatment or control group with differing content or outcome measures (N asymp 430)produces a mean d asymp 075 Grouping studies by the study author such that a meanis calculated across all of a single authorrsquos studies (N = 160) produces a meand asymp 088

To illustrate the various moderatorsrsquo influence on the effect sizes the same-study-author aggregation is employed Studies in which the treatment group testedmore frequently than the control group produced a mean d asymp 085 Studies in whichthe treatment group was tested with higher stakes than the control group produceda mean d asymp 087 Studies in which the treatment group was made aware of theirtest performance or course progress (and the control group was not) produced amean d asymp 098 Studies in which the treatment group was remediated or receivedsome other type of corrective action based on their test performance produced amean d asymp 096 (See Table 7)

Note that these treatments such as testing more frequently or with higherstakes need not be mutually exclusive can be used complementarily and oftenare This suggests that the optimal intervention would not choose among theseinterventionsmdashall with strong effect sizesmdashbut rather use as many of them aspossible at the same time

TABLE 7Mean Effect Sizes for Various Treatments

Treatment Group Mean Effect Size

is made aware of performance and control group is not 098

receives targeted instruction (eg remediation) 096

is tested with higher stakes than control group 087

is tested more frequently than control group 085

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 35

TABLE 8Source Sponsor Study Design and Scale in Quantitative Studies

Number of Studies Mean Effect Size

Source of TestResearcher or Teacher 87 093National 24 087Commercial 38 082State or District 11 072Total 160

Sponsor of TestInternational 5 102Local 99 093National 45 081State 11 064Total 160

Study DesignPre-post 12 097Experiment Quasi-experiment 107 094Multivariate 26 080Experiment Posttest only 7 060Pre-post (with shadow test) 8 058Total 160

Scale of AnalysisAggregated 9 160Small-scale 118 091Large-scale 33 057Total 160

Scale of Test AdministrationClassroom 115 095Mid-scale 6 072Large-scale 39 071Total 160

Mean effect sizes can be compared between other pairs of moderators too (seeTable 8) For example small population studies such as experiments produce anaverage effect size of (091) whereas large population studies such as those usingnational populations produce an average effect size of (057) Studies of small-scale test administrations (092) produce a larger mean effect size than do studies oflarge-scale test administrations (078) Studies of local test administrations producea larger mean effect size (093) than do studies of state test administrations (064)

Studies conducted at universities (which tend to be small-scale) produce a largermean effect size (102) than do studies conducted at the elementary-secondarylevel (076) (which are often large-scale) Studies in which the intervention occurs

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

36 PHELPS

before the outcome measurement produce a larger mean effect size (093) thando ldquobackwashrdquo studies for which the intervention occurs afterwards (040) (egwhen an increase in lower-secondary test scores is observed after the introductionof a high-stakes upper-secondary examination) Studies in which the subject mattercontent of the intervention and outcome are aligned produce a larger mean effectsize (091) than do studies in which they are not aligned (078)

The quantitative studies vary dramatically in population size (from just 16 to11 million) and their effect sizes tend to correlate negatively The mean for thesmallest quintile among the full 640 effects for example is 104 whereas thatfor the largest quintile is 030 Starting with the adjusted overall mean effect sizeof 071 (for all 640 effects) the mean grows steadily larger as the studies withthe largest populations are removed For all studies with populations less than100000 the mean effect size is 073 under 10000 080 under 1000 083 under500 88 under 100 92 and under 50 106

Generally weighting reduces mean effect sizes further suggesting that thelarger the study population the weaker the results For example the unweightedeffect size of 030 for the largest study population quintile compares to its weightedcounterpart of 028 For the smallest study population quintile the effect size is104 unweighted but only 079 weighted

Survey Studies

Extracting meaning from the 800+ separate survey items required condensationFirst survey items were separated into two overarching groups in which theperception of a testing effect is either explicit or inferred Within the explicitor directly observed category survey items were further classified into two itemtypes those related to improving learning or to improving instruction

Among the 800+ individual survey items analyzed few were worded exactlylike others Often the meaning of two different questions or prompts is verysimilar however even though the wording varies Consider a question posed toUS teachers in the state of North Carolina in 2002 ldquoHave students learned morebecause of the [testing program]rdquo Eighty-three percent responded favorably and12 unfavorably while 5 were neutral (eg no opinion donrsquot know) Thissurvey item is classified in the ldquoimprove learningrdquo item group

Now consider a question posed to teachers in the Canadian Province of Albertain 1990 ldquoThe diploma examinations have positively affected the way in which Iteachrdquo Forty-one percent responded favorably 32 unfavorably and the remain-ing were neutral (or unresponsive) This survey item is classified in the ldquoimproveinstructionrdquo item group

Second the many survey items for which the respondentsrsquo perception of a test-ing effect was inferred were separated into several item groups including ldquoFavorgrade promotion examsrdquo ldquoFavor graduation examsrdquo ldquoFavor teacher competency

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 37

examsrdquo ldquoFavor teacher recertification examsrdquo ldquoFavor more testingrdquo Hold schoolsaccountable for studentsrsquo scoresrdquo and ldquoHold teachers accountable for studentsrsquoscoresrdquo

A perception of a testing effect is inferred when an expression of support for atesting program derives from a belief that the program is beneficial Conversely abelief that a testing program is harmful is inferred from an expression of opposition

For the most basic effects of testing (eg improves learning or instruction) andthe most basic uses of tests (eg graduation certification) effect sizes are largein many cases more than 10 and in some cases more than 20 Effect sizes areweaker for situations in which one group is held accountable for the performanceof anothermdashholding either teachers or schools accountable for student scores

The results vary at most negligibly with study level of rigor or test stakesAnd the results are overwhelmingly positive whether the survey item focusedon testingrsquos affect on improving learning or improving instruction Some of thestudy results can be found in the following tables Table 9 lists the unweightedand weighted effect sizes by item group for the total sample respondent popula-tion With the exception of the item group ldquohold teachers accountable for studentscoresrdquo effect sizes are positive for all item groups The most popular item groupsappear to be ldquoFavor teacher recertification examsrdquo ldquoFavor teacher competency ex-amsrdquo ldquoFavor (student) graduation examsrdquo and ldquoFavor (student) grade promotionexamsrdquo

Effect sizesmdashweighted or unweightedmdashfor these four item groups are all pos-itive and large (ie gt 08) (Cohen 1988) The effect sizes for other item groupssuch as ldquoimprove learningrdquo and ldquoimprove instructionrdquo also are positive and largewhether weighted or unweighted The remaining two item groupsmdashldquofavor moretestingrdquo and ldquohold schools accountable for (student) scoresrdquomdashgarner moderatelylarge positive effect sizes whether weighted or unweighted

Table 8 also breaks out the effect size results by two different groups of re-spondents education providers and consumers Education providers comprise pri-marily administrators and teachers whereas education consumers comprise mostlystudents and noneducator adults

As groups providers and consumers tend to respond similarlymdashstrongly andpositivelymdashto items about test effects (tests ldquoimprove learningrdquo and ldquoimproveinstructionrdquo) Likewise both groups strongly favor high-stakes testing for students(ldquofavor grade promotion examsrdquo and ldquofavor graduation examsrdquo)

Provider and consumer groups differ in their responses in the other item groupshowever For example consumers are overwhelmingly in favor of teacher testingboth competency (ie certification) and recertification exams with weighted effectsizes of 22 and 20 respectively Provider responses are positive but not nearly asstrong with weighted effect sizes of 03 and 04 respectively

Clear differences of opinion between provider and consumer groups can beseen among the remaining item groups too Consumers tend to favor more testing

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

TAB

LE9

Effe

ctS

izes

ofS

urve

yR

espo

nses

By

Item

Gro

upF

orTo

talP

opul

atio

nE

duca

tion

Pro

vide

rsa

ndE

duca

tion

Con

sum

ers

1958

ndash200

81

Tota

lE

duca

tion

Pro

vide

rs2

Edu

cati

onC

onsu

mer

s3

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofE

ffec

tSiz

eE

ffec

tSiz

eN

umbe

rof

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofIt

emTy

pe(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s

Test

ing

Impr

oves

Lea

rnin

g1

30

524

21

20

514

41

30

698

Test

ing

Impr

oves

Inst

ruct

ion

10

07

165

10

06

125

12

10

40Te

stin

gA

lign

sIn

stru

ctio

n0

60

028

Favo

rG

rade

Pro

mot

ion

Exa

ms

13

11

811

21

026

13

11

53Fa

vor

Gra

duat

ion

Exa

ms

14

11

121

12

08

381

41

281

Favo

rM

ore

Test

ing

05

07

72minus0

20

119

07

08

53Fa

vor

Teac

her

Com

pete

ncy

Exa

ms

21

08

611

10

317

25

22

44Fa

vor

Teac

her

Rec

erti

fica

tion

Exa

ms

21

15

222

52

018

Hol

dTe

ache

rsA

ccou

ntab

lefo

rS

core

sminus0

10

031

minus06

minus05

180

60

513

Hol

dS

choo

lsA

ccou

ntab

lefo

rS

core

s0

60

511

10

2minus0

338

09

07

73

Not

es1

Bla

nkce

lls

repr

esen

tnum

bers

ofit

ems

too

smal

lto

repo

rtva

lidl

y2

Edu

cati

onpr

ovid

ers

are

educ

atio

nad

min

istr

ator

spr

inci

pals

tea

cher

sed

ucat

ion

agen

cyof

fici

als

and

univ

ersi

tyfa

cult

yin

educ

atio

npr

ogra

ms

3E

duca

tion

cons

umer

sar

est

uden

tsp

aren

tst

hege

nera

lpu

blic

em

ploy

ers

publ

icof

fici

als

not

wor

king

ined

ucat

ion

agen

cies

and

univ

ersi

tyfa

cult

you

tsid

eof

educ

atio

n

38

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 39

TABLE 10The Effect of Testing on Student Achievement in Qualitative Studies

Direction of Effect Number of Studies Percent of Studies Percent wo Inferred

Positive 204 84 93Positive Inferred 24 10No Change 8 3 4Mixed 5 2 2Negative 3 1 1Total 244 100 100

and holding teachers and schools accountable for student test scores Providerseither are less supportive in these item groups or are opposed For exampleprovidersrsquo weighted effect size for ldquohold teachers accountable for students scoresrdquois a moderately large ndash05

Qualitative Studies

Analysis of the qualitative studies focuses on the direction of the testing effecton achievement or instruction Ninety-three percent of the qualitative studiesanalyzed reported positive effects of testing whereas only 7 reported mixedeffects negative effects or no change

In 24 cases a positive effect on student achievement was not declared butreasonably could be inferred from statements about behavioral changes Onestudy for example reported ldquo[using] results from test to improve courseworkrdquoIf coursework was indeed improved one might reasonably assume that studentachievement benefited as well Whether the ldquopositive inferredrdquo studies are includedin the summary or not however the counts indicate that far more studies findpositive than mixed negative or no effects The main results of this researchsummary are displayed in Table 10

DISCUSSION

One hundred yearsrsquo evidence suggests that testing increases achievement Effectsare moderately to strongly positive for quantitative studies and are very stronglypositive for survey and qualitative studies

The overwhelmingly positive results of the qualitative research review in partic-ular may surprise some readers The results should be considered reliable becausethe base of evidence is so largemdash245 studies conducted over the course of acentury in more than 30 countries

Qualitative studies have been held up by testing opponents as a higher standardfor studies of educational impact Labeling quantitative studies as narrow limited

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 7: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

THE EFFECT OF TESTING ON ACHIEVEMENT 27

05

10

1520253035

404550

Public Opinion Polls Program EvaluationSurveys

Per

cen

t EducationProviders

EducationConsumers

FIGURE 1Percentage of survey items By respondent group and type of survey

Note lowastThree program evaluation survey items were posed to a combinedprovider-consumer group and for the purpose of this figure those three items have been

counted twice

reason as high-stakes tests are more politically controversial and more familiar tothe public so pollsters and program evaluators are more likely to ask about them

Table 2 also classifies the survey items by the target of the stakesmdashthe groupor groups bearing the consequences of good or poor test performance For aplurality of items students bore the consequences (46) Schools (33) and

TABLE 2Number and Percent of Survey Items By Test Stakes and Their Target Group

Number of Survey Items Percent of Survey Items

Level of StakesHigh 507 62Medium 184 23Unknown 89 11Low 33 4Total 813 100

Target GroupStudents 393 46Schools 281 33Teachers 116 14No Stakes 64 7Total 854 100

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

28 PHELPS

teachers (14) were less often the target of test stakes In some cases stakes wereapplicable to more than one group

Qualitative studies Any study for which an effect size could not be com-puted was designated as qualitative Typically qualitative studies declare evidenceof effects in categorical terms and incorporate more ldquohands onrdquo research methodssuch as direct observations site visits interviews or case studies Here analysis of244 qualitative studies focuses exclusively on the reported direction of the testingeffect on achievement

To determine whether a qualitative study indicated a positive neutral or nega-tive effect of testing on achievement a specific qualitative meta-analysis approachcalled template analysis was employed (eg Crabtree amp Miller 1999 King 19982006 Miles amp Huberman 1994) A hierarchical coding rubric was developed tosummarize and organize the studies according to themes predetermined as im-portant by previous research and informed by ongoing analysis of the data setThe template establishes broad themes (eg stakes) initially then successivelysmaller themes (eg high medium or low) are generated within each broadtheme The template rubric comprised 6 themes and 27 subthemes identified inTable 3

All studies were reviewed and coded by three reviewers a doctoral student inmeta-analysis an editor and me Any disparities in coding were discussed amongreviewers in an attempt to reach a consensus In less than 10 cases in which a clearconsensus was not reached the author made the coding decisions

Qualitative studies used various research methods and in various combinationsIndeed 37 of the 245 studies incorporated more than one method (Table 4)

Table 5 categorizes the qualitative studies according to the level of educationand the scale stakes and target of the stakes of the tests involved Most (81)studies included in this analysis were large-scale while 17 of the studies wereat the classroom level Two percent of the studies focused on testing for teachers

Summary Table 6 lists the overall numbers of studies and effects bytypemdashquantitative survey and qualitativemdashand geographic coveragemdashUS ornon-US Several studies fit more than one category (eg a program evaluationwith both survey and observational case study components)

Calculating Effects

For quantitative and survey studies effect sizes were calculated and the datasummarized with weighted and unweighted means Qualitative studies that reacheda conclusion about the effect of testing on student achievement or on instructionwere tallied as positive negative and in-between

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 29

TABLE 3Coding Themes and Subthemes Used in Template Analysis

Themes Subthemes

Scale of Test bull Large-Scalebull Classroombull Test for Teachersbull Not Specified

Stakes of Test bull Highbull Mediumbull Lowbull Variesbull Not Specified

Target of Stakes bull Studentbull Teacherbull School

Research Methods bull Case Studybull InterviewFocus Groupbull JournalWork Logbull RecordsDocument Reviewbull Research Reviewbull ExperimentPre-Post Comparison for which No

Effect Size Was Reportedbull Survey for which No Effect Size Was Reported

Rigor of Qualitative Study bull Highbull Mediumbull Low

Direction of Testing Effects bull Positivebull Positive Inferredbull Mixedbull No Changebull Negative

TABLE 4Research Method Used in Qualitative Studies

ThemesSubthemes Number of Studies Percent of Studies

Case Study 120 43Interview (Includes Focus Group) 75 27Records or Document Review 33 12Survey 22 8Experiment or Pre-Post Comparison 21 7Research Review 8 3Journals or Work Logs 2 1Total 281 101

Note Percentage sums to greater than 100 due to rounding

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

30 PHELPS

TABLE 5Level of Education Scale Stakes and Target of Stakes in Qualitative Studies

Number of Studies Percent of Studies

Level of EducationTwo or More Levels 115 50Upper Secondary 55 24Elementary 34 15Postsecondary 18 8Lower Secondary 7 3Adult Education 3 1Total 232 101

Scale of TestLarge-scale 196 81Classroom 42 17Test for Teachers 5 2Total 243 100

Stakes of TestHigh 154 63Low 51 21Medium 33 14VariesNot Specified 6 3Total 244 101

Target of StakesStudent 150 62School 80 33Teacher 10 4Varies 1 lt1Total 241 100

Note Percentages may sum to greater than 100 due to rounding

TABLE 6Number of Studies and Effects By Type

Type of Number of Number of Population Coverage Percent of StudiesStudy Studies Effects (in thousands) with US Focus

Quantitative 177 640 7000 89Surveys amp Polls 247 813 700 95Qualitative 245 245 unknown 65Total 669 1698

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 31

Several formulae are employed to calculate effect sizes each selected to alignto its relevant study design The most frequently used is some variation of thestandardized mean difference (Cohen 1988)

d = XT minus XC

sp(1)

where

XT = number in treatment groupXC = number in control groupSp = pooled standard deviation andd = effect size

For those studies that compare proportions rather than means an effect sizecomparable to the standard mean difference effect size is calculated with a log-oddsratio and a logit function transformation (Lipsey amp Wilson 2001 pp 53ndash58)

ES =(

loge

[PT

1 minus PT

]minus loge

[PC

1 minus PC

])183 (2)

PT is the proportion of persons in the treatment group with ldquosuccessfulrdquo out-comes (eg passing a test graduating) and PC is the proportion of persons incontrol group with successful outcomes The logit function transformation is ac-complished by dividing the log-odds ratio by 183

For survey studies the response pattern for each survey is quantitatively sum-marized in one of two ways

bull as frequencies for each point along a Likert scale which can then be summarizedby standard measures of central tendency and dispersion or

bull as frequencies for each multiple choice response option which can then beconverted to percentages

Just over 100 of the almost 800+ survey items reported their responses inmeans and standard deviations on a scale Effects for scale items are calculatedas the difference of the response mean from the scale midpoint divided by thestandard deviation

Close to 700 other survey items with multiple-choice response formats reportedthe frequency and percentage for each choice The analysis collapses these multiplechoices into dichotomous choicesmdashyes and no for and against favorable andunfavorable agree and disagree support and oppose etc Responses that fit neithersidemdashneutral no opinion donrsquot know no responsemdashare not counted An effect

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

32 PHELPS

size is then calculated with the aforementioned log-odds ratio and a logit functiontransformation

Weighting Effect sizes are summarized in two ways with an unweightedmean in which case each effect size counts the same regardless the size of thepopulation under study and with a weighted mean in which case effect sizemeasures calculated on larger groups count for more

For between 10 and 20 of the quantitative studies (depending on how oneaggregates them) that incorporated pre-post comparison designs weights (ieinverse variances) are calculated thus (Lipsey amp Wilson 2001 p 72)

w = 2N

4(1 minus r) + d2(3)

where

N = number in study populationr = correlation coefficient andd = effect size

For the remainder of the studies with standardized mean difference effectsizes weights (ie inverse variances) are calculated thus (Lipsey amp Wilson 2001p 72)

w = 2 times NT times NC(NT + NC)

2(NT + NC)2 + NT times NC times d2(4)

where

NT = number in treatment groupNC = number in control group andd = effect size

For the group of about 100 survey items with Likert-scale responses withstandardized mean effect sizes standard errors and weights (ie inverse variances)are calculated thus

SE = sradicn

w = n

s2(5 6)

wheres = standard deviation andn = number of respondents

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 33

For the group of about 700 items with proportional-difference responses withlog-odds-ratio effect sizes standard errors and weights are calculated thusly

w = 1

SE 2SE =

radic1

a+ 1

b+ 1

c+ 1

d(7 8)

where

a = number of respondents in favorb = number of respondents not in group ac = number of respondents against andd = number of respondents not in group c

Once the weights and standard errors are calculated effect sizes are thenldquoweightedrdquo by the following process (Lipsey amp Wilson 2001 pp 130ndash132)

1 Each effect size is multiplied by its weight wes2 The weighted mean effect size is calculated thusly

ES =sum

wesisumwi

(9)

Standards for judging effect sizes Cohen (1988) proposed that a d lt

020 represented a small effect and a d gt 080 represented a large effect Avalue for d in between then represented a medium effect In his meta-analysis ofmeta-analyses however Hattie (2009) placed the average d for all meta-analysesof student achievement effect studies near 040 with a pseudo-Hawthorne Effectbase level around 010 (ie the level at which any achievement-related interventionnot producing a clear positive or negative effect will settle)

RESULTS

Quantitative Studies

For quantitative studies simple and weighted effect sizes are calculated for sev-eral different aggregations (eg same treatment group control group or studyauthor) Mean effect sizes range from a moderate d asymp 055 to a fairly large d asymp088 depending on the way effects are aggregated or effect sizes are adjusted forstudy artifacts Testing with feedback produces the strongest positive effect on

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

34 PHELPS

achievement Adding stakes or testing with greater frequency also strongly andpositively affects achievement

Studies are classified with dozens of moderators including size of study pop-ulation scale of test administration responsible jurisdiction (eg nation stateclassroom) level of education and primary focus of study (eg memory reten-tion testing frequency mastery testing accountability program)

For the full panoply of 640 different effects the ldquobare bonesrdquo mean d asymp055 After adjustments for small sample sizes (Hedges 1981) and measurementuncertainty in the dependent variable (Hunter amp Schmidt 2004 chapter 8) themean d asymp 071 an apparently stronger effect5

For different aggregations of studies adjusted mean effect sizes vary Groupingstudies by the same treatment or control group or content with differing outcomemeasures (N asymp 600) produces a mean d asymp 073 Grouping studies by the sametreatment or control group with differing content or outcome measures (N asymp 430)produces a mean d asymp 075 Grouping studies by the study author such that a meanis calculated across all of a single authorrsquos studies (N = 160) produces a meand asymp 088

To illustrate the various moderatorsrsquo influence on the effect sizes the same-study-author aggregation is employed Studies in which the treatment group testedmore frequently than the control group produced a mean d asymp 085 Studies in whichthe treatment group was tested with higher stakes than the control group produceda mean d asymp 087 Studies in which the treatment group was made aware of theirtest performance or course progress (and the control group was not) produced amean d asymp 098 Studies in which the treatment group was remediated or receivedsome other type of corrective action based on their test performance produced amean d asymp 096 (See Table 7)

Note that these treatments such as testing more frequently or with higherstakes need not be mutually exclusive can be used complementarily and oftenare This suggests that the optimal intervention would not choose among theseinterventionsmdashall with strong effect sizesmdashbut rather use as many of them aspossible at the same time

TABLE 7Mean Effect Sizes for Various Treatments

Treatment Group Mean Effect Size

is made aware of performance and control group is not 098

receives targeted instruction (eg remediation) 096

is tested with higher stakes than control group 087

is tested more frequently than control group 085

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 35

TABLE 8Source Sponsor Study Design and Scale in Quantitative Studies

Number of Studies Mean Effect Size

Source of TestResearcher or Teacher 87 093National 24 087Commercial 38 082State or District 11 072Total 160

Sponsor of TestInternational 5 102Local 99 093National 45 081State 11 064Total 160

Study DesignPre-post 12 097Experiment Quasi-experiment 107 094Multivariate 26 080Experiment Posttest only 7 060Pre-post (with shadow test) 8 058Total 160

Scale of AnalysisAggregated 9 160Small-scale 118 091Large-scale 33 057Total 160

Scale of Test AdministrationClassroom 115 095Mid-scale 6 072Large-scale 39 071Total 160

Mean effect sizes can be compared between other pairs of moderators too (seeTable 8) For example small population studies such as experiments produce anaverage effect size of (091) whereas large population studies such as those usingnational populations produce an average effect size of (057) Studies of small-scale test administrations (092) produce a larger mean effect size than do studies oflarge-scale test administrations (078) Studies of local test administrations producea larger mean effect size (093) than do studies of state test administrations (064)

Studies conducted at universities (which tend to be small-scale) produce a largermean effect size (102) than do studies conducted at the elementary-secondarylevel (076) (which are often large-scale) Studies in which the intervention occurs

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

36 PHELPS

before the outcome measurement produce a larger mean effect size (093) thando ldquobackwashrdquo studies for which the intervention occurs afterwards (040) (egwhen an increase in lower-secondary test scores is observed after the introductionof a high-stakes upper-secondary examination) Studies in which the subject mattercontent of the intervention and outcome are aligned produce a larger mean effectsize (091) than do studies in which they are not aligned (078)

The quantitative studies vary dramatically in population size (from just 16 to11 million) and their effect sizes tend to correlate negatively The mean for thesmallest quintile among the full 640 effects for example is 104 whereas thatfor the largest quintile is 030 Starting with the adjusted overall mean effect sizeof 071 (for all 640 effects) the mean grows steadily larger as the studies withthe largest populations are removed For all studies with populations less than100000 the mean effect size is 073 under 10000 080 under 1000 083 under500 88 under 100 92 and under 50 106

Generally weighting reduces mean effect sizes further suggesting that thelarger the study population the weaker the results For example the unweightedeffect size of 030 for the largest study population quintile compares to its weightedcounterpart of 028 For the smallest study population quintile the effect size is104 unweighted but only 079 weighted

Survey Studies

Extracting meaning from the 800+ separate survey items required condensationFirst survey items were separated into two overarching groups in which theperception of a testing effect is either explicit or inferred Within the explicitor directly observed category survey items were further classified into two itemtypes those related to improving learning or to improving instruction

Among the 800+ individual survey items analyzed few were worded exactlylike others Often the meaning of two different questions or prompts is verysimilar however even though the wording varies Consider a question posed toUS teachers in the state of North Carolina in 2002 ldquoHave students learned morebecause of the [testing program]rdquo Eighty-three percent responded favorably and12 unfavorably while 5 were neutral (eg no opinion donrsquot know) Thissurvey item is classified in the ldquoimprove learningrdquo item group

Now consider a question posed to teachers in the Canadian Province of Albertain 1990 ldquoThe diploma examinations have positively affected the way in which Iteachrdquo Forty-one percent responded favorably 32 unfavorably and the remain-ing were neutral (or unresponsive) This survey item is classified in the ldquoimproveinstructionrdquo item group

Second the many survey items for which the respondentsrsquo perception of a test-ing effect was inferred were separated into several item groups including ldquoFavorgrade promotion examsrdquo ldquoFavor graduation examsrdquo ldquoFavor teacher competency

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 37

examsrdquo ldquoFavor teacher recertification examsrdquo ldquoFavor more testingrdquo Hold schoolsaccountable for studentsrsquo scoresrdquo and ldquoHold teachers accountable for studentsrsquoscoresrdquo

A perception of a testing effect is inferred when an expression of support for atesting program derives from a belief that the program is beneficial Conversely abelief that a testing program is harmful is inferred from an expression of opposition

For the most basic effects of testing (eg improves learning or instruction) andthe most basic uses of tests (eg graduation certification) effect sizes are largein many cases more than 10 and in some cases more than 20 Effect sizes areweaker for situations in which one group is held accountable for the performanceof anothermdashholding either teachers or schools accountable for student scores

The results vary at most negligibly with study level of rigor or test stakesAnd the results are overwhelmingly positive whether the survey item focusedon testingrsquos affect on improving learning or improving instruction Some of thestudy results can be found in the following tables Table 9 lists the unweightedand weighted effect sizes by item group for the total sample respondent popula-tion With the exception of the item group ldquohold teachers accountable for studentscoresrdquo effect sizes are positive for all item groups The most popular item groupsappear to be ldquoFavor teacher recertification examsrdquo ldquoFavor teacher competency ex-amsrdquo ldquoFavor (student) graduation examsrdquo and ldquoFavor (student) grade promotionexamsrdquo

Effect sizesmdashweighted or unweightedmdashfor these four item groups are all pos-itive and large (ie gt 08) (Cohen 1988) The effect sizes for other item groupssuch as ldquoimprove learningrdquo and ldquoimprove instructionrdquo also are positive and largewhether weighted or unweighted The remaining two item groupsmdashldquofavor moretestingrdquo and ldquohold schools accountable for (student) scoresrdquomdashgarner moderatelylarge positive effect sizes whether weighted or unweighted

Table 8 also breaks out the effect size results by two different groups of re-spondents education providers and consumers Education providers comprise pri-marily administrators and teachers whereas education consumers comprise mostlystudents and noneducator adults

As groups providers and consumers tend to respond similarlymdashstrongly andpositivelymdashto items about test effects (tests ldquoimprove learningrdquo and ldquoimproveinstructionrdquo) Likewise both groups strongly favor high-stakes testing for students(ldquofavor grade promotion examsrdquo and ldquofavor graduation examsrdquo)

Provider and consumer groups differ in their responses in the other item groupshowever For example consumers are overwhelmingly in favor of teacher testingboth competency (ie certification) and recertification exams with weighted effectsizes of 22 and 20 respectively Provider responses are positive but not nearly asstrong with weighted effect sizes of 03 and 04 respectively

Clear differences of opinion between provider and consumer groups can beseen among the remaining item groups too Consumers tend to favor more testing

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

TAB

LE9

Effe

ctS

izes

ofS

urve

yR

espo

nses

By

Item

Gro

upF

orTo

talP

opul

atio

nE

duca

tion

Pro

vide

rsa

ndE

duca

tion

Con

sum

ers

1958

ndash200

81

Tota

lE

duca

tion

Pro

vide

rs2

Edu

cati

onC

onsu

mer

s3

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofE

ffec

tSiz

eE

ffec

tSiz

eN

umbe

rof

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofIt

emTy

pe(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s

Test

ing

Impr

oves

Lea

rnin

g1

30

524

21

20

514

41

30

698

Test

ing

Impr

oves

Inst

ruct

ion

10

07

165

10

06

125

12

10

40Te

stin

gA

lign

sIn

stru

ctio

n0

60

028

Favo

rG

rade

Pro

mot

ion

Exa

ms

13

11

811

21

026

13

11

53Fa

vor

Gra

duat

ion

Exa

ms

14

11

121

12

08

381

41

281

Favo

rM

ore

Test

ing

05

07

72minus0

20

119

07

08

53Fa

vor

Teac

her

Com

pete

ncy

Exa

ms

21

08

611

10

317

25

22

44Fa

vor

Teac

her

Rec

erti

fica

tion

Exa

ms

21

15

222

52

018

Hol

dTe

ache

rsA

ccou

ntab

lefo

rS

core

sminus0

10

031

minus06

minus05

180

60

513

Hol

dS

choo

lsA

ccou

ntab

lefo

rS

core

s0

60

511

10

2minus0

338

09

07

73

Not

es1

Bla

nkce

lls

repr

esen

tnum

bers

ofit

ems

too

smal

lto

repo

rtva

lidl

y2

Edu

cati

onpr

ovid

ers

are

educ

atio

nad

min

istr

ator

spr

inci

pals

tea

cher

sed

ucat

ion

agen

cyof

fici

als

and

univ

ersi

tyfa

cult

yin

educ

atio

npr

ogra

ms

3E

duca

tion

cons

umer

sar

est

uden

tsp

aren

tst

hege

nera

lpu

blic

em

ploy

ers

publ

icof

fici

als

not

wor

king

ined

ucat

ion

agen

cies

and

univ

ersi

tyfa

cult

you

tsid

eof

educ

atio

n

38

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 39

TABLE 10The Effect of Testing on Student Achievement in Qualitative Studies

Direction of Effect Number of Studies Percent of Studies Percent wo Inferred

Positive 204 84 93Positive Inferred 24 10No Change 8 3 4Mixed 5 2 2Negative 3 1 1Total 244 100 100

and holding teachers and schools accountable for student test scores Providerseither are less supportive in these item groups or are opposed For exampleprovidersrsquo weighted effect size for ldquohold teachers accountable for students scoresrdquois a moderately large ndash05

Qualitative Studies

Analysis of the qualitative studies focuses on the direction of the testing effecton achievement or instruction Ninety-three percent of the qualitative studiesanalyzed reported positive effects of testing whereas only 7 reported mixedeffects negative effects or no change

In 24 cases a positive effect on student achievement was not declared butreasonably could be inferred from statements about behavioral changes Onestudy for example reported ldquo[using] results from test to improve courseworkrdquoIf coursework was indeed improved one might reasonably assume that studentachievement benefited as well Whether the ldquopositive inferredrdquo studies are includedin the summary or not however the counts indicate that far more studies findpositive than mixed negative or no effects The main results of this researchsummary are displayed in Table 10

DISCUSSION

One hundred yearsrsquo evidence suggests that testing increases achievement Effectsare moderately to strongly positive for quantitative studies and are very stronglypositive for survey and qualitative studies

The overwhelmingly positive results of the qualitative research review in partic-ular may surprise some readers The results should be considered reliable becausethe base of evidence is so largemdash245 studies conducted over the course of acentury in more than 30 countries

Qualitative studies have been held up by testing opponents as a higher standardfor studies of educational impact Labeling quantitative studies as narrow limited

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 8: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

28 PHELPS

teachers (14) were less often the target of test stakes In some cases stakes wereapplicable to more than one group

Qualitative studies Any study for which an effect size could not be com-puted was designated as qualitative Typically qualitative studies declare evidenceof effects in categorical terms and incorporate more ldquohands onrdquo research methodssuch as direct observations site visits interviews or case studies Here analysis of244 qualitative studies focuses exclusively on the reported direction of the testingeffect on achievement

To determine whether a qualitative study indicated a positive neutral or nega-tive effect of testing on achievement a specific qualitative meta-analysis approachcalled template analysis was employed (eg Crabtree amp Miller 1999 King 19982006 Miles amp Huberman 1994) A hierarchical coding rubric was developed tosummarize and organize the studies according to themes predetermined as im-portant by previous research and informed by ongoing analysis of the data setThe template establishes broad themes (eg stakes) initially then successivelysmaller themes (eg high medium or low) are generated within each broadtheme The template rubric comprised 6 themes and 27 subthemes identified inTable 3

All studies were reviewed and coded by three reviewers a doctoral student inmeta-analysis an editor and me Any disparities in coding were discussed amongreviewers in an attempt to reach a consensus In less than 10 cases in which a clearconsensus was not reached the author made the coding decisions

Qualitative studies used various research methods and in various combinationsIndeed 37 of the 245 studies incorporated more than one method (Table 4)

Table 5 categorizes the qualitative studies according to the level of educationand the scale stakes and target of the stakes of the tests involved Most (81)studies included in this analysis were large-scale while 17 of the studies wereat the classroom level Two percent of the studies focused on testing for teachers

Summary Table 6 lists the overall numbers of studies and effects bytypemdashquantitative survey and qualitativemdashand geographic coveragemdashUS ornon-US Several studies fit more than one category (eg a program evaluationwith both survey and observational case study components)

Calculating Effects

For quantitative and survey studies effect sizes were calculated and the datasummarized with weighted and unweighted means Qualitative studies that reacheda conclusion about the effect of testing on student achievement or on instructionwere tallied as positive negative and in-between

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 29

TABLE 3Coding Themes and Subthemes Used in Template Analysis

Themes Subthemes

Scale of Test bull Large-Scalebull Classroombull Test for Teachersbull Not Specified

Stakes of Test bull Highbull Mediumbull Lowbull Variesbull Not Specified

Target of Stakes bull Studentbull Teacherbull School

Research Methods bull Case Studybull InterviewFocus Groupbull JournalWork Logbull RecordsDocument Reviewbull Research Reviewbull ExperimentPre-Post Comparison for which No

Effect Size Was Reportedbull Survey for which No Effect Size Was Reported

Rigor of Qualitative Study bull Highbull Mediumbull Low

Direction of Testing Effects bull Positivebull Positive Inferredbull Mixedbull No Changebull Negative

TABLE 4Research Method Used in Qualitative Studies

ThemesSubthemes Number of Studies Percent of Studies

Case Study 120 43Interview (Includes Focus Group) 75 27Records or Document Review 33 12Survey 22 8Experiment or Pre-Post Comparison 21 7Research Review 8 3Journals or Work Logs 2 1Total 281 101

Note Percentage sums to greater than 100 due to rounding

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

30 PHELPS

TABLE 5Level of Education Scale Stakes and Target of Stakes in Qualitative Studies

Number of Studies Percent of Studies

Level of EducationTwo or More Levels 115 50Upper Secondary 55 24Elementary 34 15Postsecondary 18 8Lower Secondary 7 3Adult Education 3 1Total 232 101

Scale of TestLarge-scale 196 81Classroom 42 17Test for Teachers 5 2Total 243 100

Stakes of TestHigh 154 63Low 51 21Medium 33 14VariesNot Specified 6 3Total 244 101

Target of StakesStudent 150 62School 80 33Teacher 10 4Varies 1 lt1Total 241 100

Note Percentages may sum to greater than 100 due to rounding

TABLE 6Number of Studies and Effects By Type

Type of Number of Number of Population Coverage Percent of StudiesStudy Studies Effects (in thousands) with US Focus

Quantitative 177 640 7000 89Surveys amp Polls 247 813 700 95Qualitative 245 245 unknown 65Total 669 1698

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 31

Several formulae are employed to calculate effect sizes each selected to alignto its relevant study design The most frequently used is some variation of thestandardized mean difference (Cohen 1988)

d = XT minus XC

sp(1)

where

XT = number in treatment groupXC = number in control groupSp = pooled standard deviation andd = effect size

For those studies that compare proportions rather than means an effect sizecomparable to the standard mean difference effect size is calculated with a log-oddsratio and a logit function transformation (Lipsey amp Wilson 2001 pp 53ndash58)

ES =(

loge

[PT

1 minus PT

]minus loge

[PC

1 minus PC

])183 (2)

PT is the proportion of persons in the treatment group with ldquosuccessfulrdquo out-comes (eg passing a test graduating) and PC is the proportion of persons incontrol group with successful outcomes The logit function transformation is ac-complished by dividing the log-odds ratio by 183

For survey studies the response pattern for each survey is quantitatively sum-marized in one of two ways

bull as frequencies for each point along a Likert scale which can then be summarizedby standard measures of central tendency and dispersion or

bull as frequencies for each multiple choice response option which can then beconverted to percentages

Just over 100 of the almost 800+ survey items reported their responses inmeans and standard deviations on a scale Effects for scale items are calculatedas the difference of the response mean from the scale midpoint divided by thestandard deviation

Close to 700 other survey items with multiple-choice response formats reportedthe frequency and percentage for each choice The analysis collapses these multiplechoices into dichotomous choicesmdashyes and no for and against favorable andunfavorable agree and disagree support and oppose etc Responses that fit neithersidemdashneutral no opinion donrsquot know no responsemdashare not counted An effect

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

32 PHELPS

size is then calculated with the aforementioned log-odds ratio and a logit functiontransformation

Weighting Effect sizes are summarized in two ways with an unweightedmean in which case each effect size counts the same regardless the size of thepopulation under study and with a weighted mean in which case effect sizemeasures calculated on larger groups count for more

For between 10 and 20 of the quantitative studies (depending on how oneaggregates them) that incorporated pre-post comparison designs weights (ieinverse variances) are calculated thus (Lipsey amp Wilson 2001 p 72)

w = 2N

4(1 minus r) + d2(3)

where

N = number in study populationr = correlation coefficient andd = effect size

For the remainder of the studies with standardized mean difference effectsizes weights (ie inverse variances) are calculated thus (Lipsey amp Wilson 2001p 72)

w = 2 times NT times NC(NT + NC)

2(NT + NC)2 + NT times NC times d2(4)

where

NT = number in treatment groupNC = number in control group andd = effect size

For the group of about 100 survey items with Likert-scale responses withstandardized mean effect sizes standard errors and weights (ie inverse variances)are calculated thus

SE = sradicn

w = n

s2(5 6)

wheres = standard deviation andn = number of respondents

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 33

For the group of about 700 items with proportional-difference responses withlog-odds-ratio effect sizes standard errors and weights are calculated thusly

w = 1

SE 2SE =

radic1

a+ 1

b+ 1

c+ 1

d(7 8)

where

a = number of respondents in favorb = number of respondents not in group ac = number of respondents against andd = number of respondents not in group c

Once the weights and standard errors are calculated effect sizes are thenldquoweightedrdquo by the following process (Lipsey amp Wilson 2001 pp 130ndash132)

1 Each effect size is multiplied by its weight wes2 The weighted mean effect size is calculated thusly

ES =sum

wesisumwi

(9)

Standards for judging effect sizes Cohen (1988) proposed that a d lt

020 represented a small effect and a d gt 080 represented a large effect Avalue for d in between then represented a medium effect In his meta-analysis ofmeta-analyses however Hattie (2009) placed the average d for all meta-analysesof student achievement effect studies near 040 with a pseudo-Hawthorne Effectbase level around 010 (ie the level at which any achievement-related interventionnot producing a clear positive or negative effect will settle)

RESULTS

Quantitative Studies

For quantitative studies simple and weighted effect sizes are calculated for sev-eral different aggregations (eg same treatment group control group or studyauthor) Mean effect sizes range from a moderate d asymp 055 to a fairly large d asymp088 depending on the way effects are aggregated or effect sizes are adjusted forstudy artifacts Testing with feedback produces the strongest positive effect on

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

34 PHELPS

achievement Adding stakes or testing with greater frequency also strongly andpositively affects achievement

Studies are classified with dozens of moderators including size of study pop-ulation scale of test administration responsible jurisdiction (eg nation stateclassroom) level of education and primary focus of study (eg memory reten-tion testing frequency mastery testing accountability program)

For the full panoply of 640 different effects the ldquobare bonesrdquo mean d asymp055 After adjustments for small sample sizes (Hedges 1981) and measurementuncertainty in the dependent variable (Hunter amp Schmidt 2004 chapter 8) themean d asymp 071 an apparently stronger effect5

For different aggregations of studies adjusted mean effect sizes vary Groupingstudies by the same treatment or control group or content with differing outcomemeasures (N asymp 600) produces a mean d asymp 073 Grouping studies by the sametreatment or control group with differing content or outcome measures (N asymp 430)produces a mean d asymp 075 Grouping studies by the study author such that a meanis calculated across all of a single authorrsquos studies (N = 160) produces a meand asymp 088

To illustrate the various moderatorsrsquo influence on the effect sizes the same-study-author aggregation is employed Studies in which the treatment group testedmore frequently than the control group produced a mean d asymp 085 Studies in whichthe treatment group was tested with higher stakes than the control group produceda mean d asymp 087 Studies in which the treatment group was made aware of theirtest performance or course progress (and the control group was not) produced amean d asymp 098 Studies in which the treatment group was remediated or receivedsome other type of corrective action based on their test performance produced amean d asymp 096 (See Table 7)

Note that these treatments such as testing more frequently or with higherstakes need not be mutually exclusive can be used complementarily and oftenare This suggests that the optimal intervention would not choose among theseinterventionsmdashall with strong effect sizesmdashbut rather use as many of them aspossible at the same time

TABLE 7Mean Effect Sizes for Various Treatments

Treatment Group Mean Effect Size

is made aware of performance and control group is not 098

receives targeted instruction (eg remediation) 096

is tested with higher stakes than control group 087

is tested more frequently than control group 085

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 35

TABLE 8Source Sponsor Study Design and Scale in Quantitative Studies

Number of Studies Mean Effect Size

Source of TestResearcher or Teacher 87 093National 24 087Commercial 38 082State or District 11 072Total 160

Sponsor of TestInternational 5 102Local 99 093National 45 081State 11 064Total 160

Study DesignPre-post 12 097Experiment Quasi-experiment 107 094Multivariate 26 080Experiment Posttest only 7 060Pre-post (with shadow test) 8 058Total 160

Scale of AnalysisAggregated 9 160Small-scale 118 091Large-scale 33 057Total 160

Scale of Test AdministrationClassroom 115 095Mid-scale 6 072Large-scale 39 071Total 160

Mean effect sizes can be compared between other pairs of moderators too (seeTable 8) For example small population studies such as experiments produce anaverage effect size of (091) whereas large population studies such as those usingnational populations produce an average effect size of (057) Studies of small-scale test administrations (092) produce a larger mean effect size than do studies oflarge-scale test administrations (078) Studies of local test administrations producea larger mean effect size (093) than do studies of state test administrations (064)

Studies conducted at universities (which tend to be small-scale) produce a largermean effect size (102) than do studies conducted at the elementary-secondarylevel (076) (which are often large-scale) Studies in which the intervention occurs

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

36 PHELPS

before the outcome measurement produce a larger mean effect size (093) thando ldquobackwashrdquo studies for which the intervention occurs afterwards (040) (egwhen an increase in lower-secondary test scores is observed after the introductionof a high-stakes upper-secondary examination) Studies in which the subject mattercontent of the intervention and outcome are aligned produce a larger mean effectsize (091) than do studies in which they are not aligned (078)

The quantitative studies vary dramatically in population size (from just 16 to11 million) and their effect sizes tend to correlate negatively The mean for thesmallest quintile among the full 640 effects for example is 104 whereas thatfor the largest quintile is 030 Starting with the adjusted overall mean effect sizeof 071 (for all 640 effects) the mean grows steadily larger as the studies withthe largest populations are removed For all studies with populations less than100000 the mean effect size is 073 under 10000 080 under 1000 083 under500 88 under 100 92 and under 50 106

Generally weighting reduces mean effect sizes further suggesting that thelarger the study population the weaker the results For example the unweightedeffect size of 030 for the largest study population quintile compares to its weightedcounterpart of 028 For the smallest study population quintile the effect size is104 unweighted but only 079 weighted

Survey Studies

Extracting meaning from the 800+ separate survey items required condensationFirst survey items were separated into two overarching groups in which theperception of a testing effect is either explicit or inferred Within the explicitor directly observed category survey items were further classified into two itemtypes those related to improving learning or to improving instruction

Among the 800+ individual survey items analyzed few were worded exactlylike others Often the meaning of two different questions or prompts is verysimilar however even though the wording varies Consider a question posed toUS teachers in the state of North Carolina in 2002 ldquoHave students learned morebecause of the [testing program]rdquo Eighty-three percent responded favorably and12 unfavorably while 5 were neutral (eg no opinion donrsquot know) Thissurvey item is classified in the ldquoimprove learningrdquo item group

Now consider a question posed to teachers in the Canadian Province of Albertain 1990 ldquoThe diploma examinations have positively affected the way in which Iteachrdquo Forty-one percent responded favorably 32 unfavorably and the remain-ing were neutral (or unresponsive) This survey item is classified in the ldquoimproveinstructionrdquo item group

Second the many survey items for which the respondentsrsquo perception of a test-ing effect was inferred were separated into several item groups including ldquoFavorgrade promotion examsrdquo ldquoFavor graduation examsrdquo ldquoFavor teacher competency

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 37

examsrdquo ldquoFavor teacher recertification examsrdquo ldquoFavor more testingrdquo Hold schoolsaccountable for studentsrsquo scoresrdquo and ldquoHold teachers accountable for studentsrsquoscoresrdquo

A perception of a testing effect is inferred when an expression of support for atesting program derives from a belief that the program is beneficial Conversely abelief that a testing program is harmful is inferred from an expression of opposition

For the most basic effects of testing (eg improves learning or instruction) andthe most basic uses of tests (eg graduation certification) effect sizes are largein many cases more than 10 and in some cases more than 20 Effect sizes areweaker for situations in which one group is held accountable for the performanceof anothermdashholding either teachers or schools accountable for student scores

The results vary at most negligibly with study level of rigor or test stakesAnd the results are overwhelmingly positive whether the survey item focusedon testingrsquos affect on improving learning or improving instruction Some of thestudy results can be found in the following tables Table 9 lists the unweightedand weighted effect sizes by item group for the total sample respondent popula-tion With the exception of the item group ldquohold teachers accountable for studentscoresrdquo effect sizes are positive for all item groups The most popular item groupsappear to be ldquoFavor teacher recertification examsrdquo ldquoFavor teacher competency ex-amsrdquo ldquoFavor (student) graduation examsrdquo and ldquoFavor (student) grade promotionexamsrdquo

Effect sizesmdashweighted or unweightedmdashfor these four item groups are all pos-itive and large (ie gt 08) (Cohen 1988) The effect sizes for other item groupssuch as ldquoimprove learningrdquo and ldquoimprove instructionrdquo also are positive and largewhether weighted or unweighted The remaining two item groupsmdashldquofavor moretestingrdquo and ldquohold schools accountable for (student) scoresrdquomdashgarner moderatelylarge positive effect sizes whether weighted or unweighted

Table 8 also breaks out the effect size results by two different groups of re-spondents education providers and consumers Education providers comprise pri-marily administrators and teachers whereas education consumers comprise mostlystudents and noneducator adults

As groups providers and consumers tend to respond similarlymdashstrongly andpositivelymdashto items about test effects (tests ldquoimprove learningrdquo and ldquoimproveinstructionrdquo) Likewise both groups strongly favor high-stakes testing for students(ldquofavor grade promotion examsrdquo and ldquofavor graduation examsrdquo)

Provider and consumer groups differ in their responses in the other item groupshowever For example consumers are overwhelmingly in favor of teacher testingboth competency (ie certification) and recertification exams with weighted effectsizes of 22 and 20 respectively Provider responses are positive but not nearly asstrong with weighted effect sizes of 03 and 04 respectively

Clear differences of opinion between provider and consumer groups can beseen among the remaining item groups too Consumers tend to favor more testing

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

TAB

LE9

Effe

ctS

izes

ofS

urve

yR

espo

nses

By

Item

Gro

upF

orTo

talP

opul

atio

nE

duca

tion

Pro

vide

rsa

ndE

duca

tion

Con

sum

ers

1958

ndash200

81

Tota

lE

duca

tion

Pro

vide

rs2

Edu

cati

onC

onsu

mer

s3

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofE

ffec

tSiz

eE

ffec

tSiz

eN

umbe

rof

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofIt

emTy

pe(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s

Test

ing

Impr

oves

Lea

rnin

g1

30

524

21

20

514

41

30

698

Test

ing

Impr

oves

Inst

ruct

ion

10

07

165

10

06

125

12

10

40Te

stin

gA

lign

sIn

stru

ctio

n0

60

028

Favo

rG

rade

Pro

mot

ion

Exa

ms

13

11

811

21

026

13

11

53Fa

vor

Gra

duat

ion

Exa

ms

14

11

121

12

08

381

41

281

Favo

rM

ore

Test

ing

05

07

72minus0

20

119

07

08

53Fa

vor

Teac

her

Com

pete

ncy

Exa

ms

21

08

611

10

317

25

22

44Fa

vor

Teac

her

Rec

erti

fica

tion

Exa

ms

21

15

222

52

018

Hol

dTe

ache

rsA

ccou

ntab

lefo

rS

core

sminus0

10

031

minus06

minus05

180

60

513

Hol

dS

choo

lsA

ccou

ntab

lefo

rS

core

s0

60

511

10

2minus0

338

09

07

73

Not

es1

Bla

nkce

lls

repr

esen

tnum

bers

ofit

ems

too

smal

lto

repo

rtva

lidl

y2

Edu

cati

onpr

ovid

ers

are

educ

atio

nad

min

istr

ator

spr

inci

pals

tea

cher

sed

ucat

ion

agen

cyof

fici

als

and

univ

ersi

tyfa

cult

yin

educ

atio

npr

ogra

ms

3E

duca

tion

cons

umer

sar

est

uden

tsp

aren

tst

hege

nera

lpu

blic

em

ploy

ers

publ

icof

fici

als

not

wor

king

ined

ucat

ion

agen

cies

and

univ

ersi

tyfa

cult

you

tsid

eof

educ

atio

n

38

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 39

TABLE 10The Effect of Testing on Student Achievement in Qualitative Studies

Direction of Effect Number of Studies Percent of Studies Percent wo Inferred

Positive 204 84 93Positive Inferred 24 10No Change 8 3 4Mixed 5 2 2Negative 3 1 1Total 244 100 100

and holding teachers and schools accountable for student test scores Providerseither are less supportive in these item groups or are opposed For exampleprovidersrsquo weighted effect size for ldquohold teachers accountable for students scoresrdquois a moderately large ndash05

Qualitative Studies

Analysis of the qualitative studies focuses on the direction of the testing effecton achievement or instruction Ninety-three percent of the qualitative studiesanalyzed reported positive effects of testing whereas only 7 reported mixedeffects negative effects or no change

In 24 cases a positive effect on student achievement was not declared butreasonably could be inferred from statements about behavioral changes Onestudy for example reported ldquo[using] results from test to improve courseworkrdquoIf coursework was indeed improved one might reasonably assume that studentachievement benefited as well Whether the ldquopositive inferredrdquo studies are includedin the summary or not however the counts indicate that far more studies findpositive than mixed negative or no effects The main results of this researchsummary are displayed in Table 10

DISCUSSION

One hundred yearsrsquo evidence suggests that testing increases achievement Effectsare moderately to strongly positive for quantitative studies and are very stronglypositive for survey and qualitative studies

The overwhelmingly positive results of the qualitative research review in partic-ular may surprise some readers The results should be considered reliable becausethe base of evidence is so largemdash245 studies conducted over the course of acentury in more than 30 countries

Qualitative studies have been held up by testing opponents as a higher standardfor studies of educational impact Labeling quantitative studies as narrow limited

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 9: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

THE EFFECT OF TESTING ON ACHIEVEMENT 29

TABLE 3Coding Themes and Subthemes Used in Template Analysis

Themes Subthemes

Scale of Test bull Large-Scalebull Classroombull Test for Teachersbull Not Specified

Stakes of Test bull Highbull Mediumbull Lowbull Variesbull Not Specified

Target of Stakes bull Studentbull Teacherbull School

Research Methods bull Case Studybull InterviewFocus Groupbull JournalWork Logbull RecordsDocument Reviewbull Research Reviewbull ExperimentPre-Post Comparison for which No

Effect Size Was Reportedbull Survey for which No Effect Size Was Reported

Rigor of Qualitative Study bull Highbull Mediumbull Low

Direction of Testing Effects bull Positivebull Positive Inferredbull Mixedbull No Changebull Negative

TABLE 4Research Method Used in Qualitative Studies

ThemesSubthemes Number of Studies Percent of Studies

Case Study 120 43Interview (Includes Focus Group) 75 27Records or Document Review 33 12Survey 22 8Experiment or Pre-Post Comparison 21 7Research Review 8 3Journals or Work Logs 2 1Total 281 101

Note Percentage sums to greater than 100 due to rounding

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

30 PHELPS

TABLE 5Level of Education Scale Stakes and Target of Stakes in Qualitative Studies

Number of Studies Percent of Studies

Level of EducationTwo or More Levels 115 50Upper Secondary 55 24Elementary 34 15Postsecondary 18 8Lower Secondary 7 3Adult Education 3 1Total 232 101

Scale of TestLarge-scale 196 81Classroom 42 17Test for Teachers 5 2Total 243 100

Stakes of TestHigh 154 63Low 51 21Medium 33 14VariesNot Specified 6 3Total 244 101

Target of StakesStudent 150 62School 80 33Teacher 10 4Varies 1 lt1Total 241 100

Note Percentages may sum to greater than 100 due to rounding

TABLE 6Number of Studies and Effects By Type

Type of Number of Number of Population Coverage Percent of StudiesStudy Studies Effects (in thousands) with US Focus

Quantitative 177 640 7000 89Surveys amp Polls 247 813 700 95Qualitative 245 245 unknown 65Total 669 1698

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 31

Several formulae are employed to calculate effect sizes each selected to alignto its relevant study design The most frequently used is some variation of thestandardized mean difference (Cohen 1988)

d = XT minus XC

sp(1)

where

XT = number in treatment groupXC = number in control groupSp = pooled standard deviation andd = effect size

For those studies that compare proportions rather than means an effect sizecomparable to the standard mean difference effect size is calculated with a log-oddsratio and a logit function transformation (Lipsey amp Wilson 2001 pp 53ndash58)

ES =(

loge

[PT

1 minus PT

]minus loge

[PC

1 minus PC

])183 (2)

PT is the proportion of persons in the treatment group with ldquosuccessfulrdquo out-comes (eg passing a test graduating) and PC is the proportion of persons incontrol group with successful outcomes The logit function transformation is ac-complished by dividing the log-odds ratio by 183

For survey studies the response pattern for each survey is quantitatively sum-marized in one of two ways

bull as frequencies for each point along a Likert scale which can then be summarizedby standard measures of central tendency and dispersion or

bull as frequencies for each multiple choice response option which can then beconverted to percentages

Just over 100 of the almost 800+ survey items reported their responses inmeans and standard deviations on a scale Effects for scale items are calculatedas the difference of the response mean from the scale midpoint divided by thestandard deviation

Close to 700 other survey items with multiple-choice response formats reportedthe frequency and percentage for each choice The analysis collapses these multiplechoices into dichotomous choicesmdashyes and no for and against favorable andunfavorable agree and disagree support and oppose etc Responses that fit neithersidemdashneutral no opinion donrsquot know no responsemdashare not counted An effect

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

32 PHELPS

size is then calculated with the aforementioned log-odds ratio and a logit functiontransformation

Weighting Effect sizes are summarized in two ways with an unweightedmean in which case each effect size counts the same regardless the size of thepopulation under study and with a weighted mean in which case effect sizemeasures calculated on larger groups count for more

For between 10 and 20 of the quantitative studies (depending on how oneaggregates them) that incorporated pre-post comparison designs weights (ieinverse variances) are calculated thus (Lipsey amp Wilson 2001 p 72)

w = 2N

4(1 minus r) + d2(3)

where

N = number in study populationr = correlation coefficient andd = effect size

For the remainder of the studies with standardized mean difference effectsizes weights (ie inverse variances) are calculated thus (Lipsey amp Wilson 2001p 72)

w = 2 times NT times NC(NT + NC)

2(NT + NC)2 + NT times NC times d2(4)

where

NT = number in treatment groupNC = number in control group andd = effect size

For the group of about 100 survey items with Likert-scale responses withstandardized mean effect sizes standard errors and weights (ie inverse variances)are calculated thus

SE = sradicn

w = n

s2(5 6)

wheres = standard deviation andn = number of respondents

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 33

For the group of about 700 items with proportional-difference responses withlog-odds-ratio effect sizes standard errors and weights are calculated thusly

w = 1

SE 2SE =

radic1

a+ 1

b+ 1

c+ 1

d(7 8)

where

a = number of respondents in favorb = number of respondents not in group ac = number of respondents against andd = number of respondents not in group c

Once the weights and standard errors are calculated effect sizes are thenldquoweightedrdquo by the following process (Lipsey amp Wilson 2001 pp 130ndash132)

1 Each effect size is multiplied by its weight wes2 The weighted mean effect size is calculated thusly

ES =sum

wesisumwi

(9)

Standards for judging effect sizes Cohen (1988) proposed that a d lt

020 represented a small effect and a d gt 080 represented a large effect Avalue for d in between then represented a medium effect In his meta-analysis ofmeta-analyses however Hattie (2009) placed the average d for all meta-analysesof student achievement effect studies near 040 with a pseudo-Hawthorne Effectbase level around 010 (ie the level at which any achievement-related interventionnot producing a clear positive or negative effect will settle)

RESULTS

Quantitative Studies

For quantitative studies simple and weighted effect sizes are calculated for sev-eral different aggregations (eg same treatment group control group or studyauthor) Mean effect sizes range from a moderate d asymp 055 to a fairly large d asymp088 depending on the way effects are aggregated or effect sizes are adjusted forstudy artifacts Testing with feedback produces the strongest positive effect on

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

34 PHELPS

achievement Adding stakes or testing with greater frequency also strongly andpositively affects achievement

Studies are classified with dozens of moderators including size of study pop-ulation scale of test administration responsible jurisdiction (eg nation stateclassroom) level of education and primary focus of study (eg memory reten-tion testing frequency mastery testing accountability program)

For the full panoply of 640 different effects the ldquobare bonesrdquo mean d asymp055 After adjustments for small sample sizes (Hedges 1981) and measurementuncertainty in the dependent variable (Hunter amp Schmidt 2004 chapter 8) themean d asymp 071 an apparently stronger effect5

For different aggregations of studies adjusted mean effect sizes vary Groupingstudies by the same treatment or control group or content with differing outcomemeasures (N asymp 600) produces a mean d asymp 073 Grouping studies by the sametreatment or control group with differing content or outcome measures (N asymp 430)produces a mean d asymp 075 Grouping studies by the study author such that a meanis calculated across all of a single authorrsquos studies (N = 160) produces a meand asymp 088

To illustrate the various moderatorsrsquo influence on the effect sizes the same-study-author aggregation is employed Studies in which the treatment group testedmore frequently than the control group produced a mean d asymp 085 Studies in whichthe treatment group was tested with higher stakes than the control group produceda mean d asymp 087 Studies in which the treatment group was made aware of theirtest performance or course progress (and the control group was not) produced amean d asymp 098 Studies in which the treatment group was remediated or receivedsome other type of corrective action based on their test performance produced amean d asymp 096 (See Table 7)

Note that these treatments such as testing more frequently or with higherstakes need not be mutually exclusive can be used complementarily and oftenare This suggests that the optimal intervention would not choose among theseinterventionsmdashall with strong effect sizesmdashbut rather use as many of them aspossible at the same time

TABLE 7Mean Effect Sizes for Various Treatments

Treatment Group Mean Effect Size

is made aware of performance and control group is not 098

receives targeted instruction (eg remediation) 096

is tested with higher stakes than control group 087

is tested more frequently than control group 085

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 35

TABLE 8Source Sponsor Study Design and Scale in Quantitative Studies

Number of Studies Mean Effect Size

Source of TestResearcher or Teacher 87 093National 24 087Commercial 38 082State or District 11 072Total 160

Sponsor of TestInternational 5 102Local 99 093National 45 081State 11 064Total 160

Study DesignPre-post 12 097Experiment Quasi-experiment 107 094Multivariate 26 080Experiment Posttest only 7 060Pre-post (with shadow test) 8 058Total 160

Scale of AnalysisAggregated 9 160Small-scale 118 091Large-scale 33 057Total 160

Scale of Test AdministrationClassroom 115 095Mid-scale 6 072Large-scale 39 071Total 160

Mean effect sizes can be compared between other pairs of moderators too (seeTable 8) For example small population studies such as experiments produce anaverage effect size of (091) whereas large population studies such as those usingnational populations produce an average effect size of (057) Studies of small-scale test administrations (092) produce a larger mean effect size than do studies oflarge-scale test administrations (078) Studies of local test administrations producea larger mean effect size (093) than do studies of state test administrations (064)

Studies conducted at universities (which tend to be small-scale) produce a largermean effect size (102) than do studies conducted at the elementary-secondarylevel (076) (which are often large-scale) Studies in which the intervention occurs

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

36 PHELPS

before the outcome measurement produce a larger mean effect size (093) thando ldquobackwashrdquo studies for which the intervention occurs afterwards (040) (egwhen an increase in lower-secondary test scores is observed after the introductionof a high-stakes upper-secondary examination) Studies in which the subject mattercontent of the intervention and outcome are aligned produce a larger mean effectsize (091) than do studies in which they are not aligned (078)

The quantitative studies vary dramatically in population size (from just 16 to11 million) and their effect sizes tend to correlate negatively The mean for thesmallest quintile among the full 640 effects for example is 104 whereas thatfor the largest quintile is 030 Starting with the adjusted overall mean effect sizeof 071 (for all 640 effects) the mean grows steadily larger as the studies withthe largest populations are removed For all studies with populations less than100000 the mean effect size is 073 under 10000 080 under 1000 083 under500 88 under 100 92 and under 50 106

Generally weighting reduces mean effect sizes further suggesting that thelarger the study population the weaker the results For example the unweightedeffect size of 030 for the largest study population quintile compares to its weightedcounterpart of 028 For the smallest study population quintile the effect size is104 unweighted but only 079 weighted

Survey Studies

Extracting meaning from the 800+ separate survey items required condensationFirst survey items were separated into two overarching groups in which theperception of a testing effect is either explicit or inferred Within the explicitor directly observed category survey items were further classified into two itemtypes those related to improving learning or to improving instruction

Among the 800+ individual survey items analyzed few were worded exactlylike others Often the meaning of two different questions or prompts is verysimilar however even though the wording varies Consider a question posed toUS teachers in the state of North Carolina in 2002 ldquoHave students learned morebecause of the [testing program]rdquo Eighty-three percent responded favorably and12 unfavorably while 5 were neutral (eg no opinion donrsquot know) Thissurvey item is classified in the ldquoimprove learningrdquo item group

Now consider a question posed to teachers in the Canadian Province of Albertain 1990 ldquoThe diploma examinations have positively affected the way in which Iteachrdquo Forty-one percent responded favorably 32 unfavorably and the remain-ing were neutral (or unresponsive) This survey item is classified in the ldquoimproveinstructionrdquo item group

Second the many survey items for which the respondentsrsquo perception of a test-ing effect was inferred were separated into several item groups including ldquoFavorgrade promotion examsrdquo ldquoFavor graduation examsrdquo ldquoFavor teacher competency

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 37

examsrdquo ldquoFavor teacher recertification examsrdquo ldquoFavor more testingrdquo Hold schoolsaccountable for studentsrsquo scoresrdquo and ldquoHold teachers accountable for studentsrsquoscoresrdquo

A perception of a testing effect is inferred when an expression of support for atesting program derives from a belief that the program is beneficial Conversely abelief that a testing program is harmful is inferred from an expression of opposition

For the most basic effects of testing (eg improves learning or instruction) andthe most basic uses of tests (eg graduation certification) effect sizes are largein many cases more than 10 and in some cases more than 20 Effect sizes areweaker for situations in which one group is held accountable for the performanceof anothermdashholding either teachers or schools accountable for student scores

The results vary at most negligibly with study level of rigor or test stakesAnd the results are overwhelmingly positive whether the survey item focusedon testingrsquos affect on improving learning or improving instruction Some of thestudy results can be found in the following tables Table 9 lists the unweightedand weighted effect sizes by item group for the total sample respondent popula-tion With the exception of the item group ldquohold teachers accountable for studentscoresrdquo effect sizes are positive for all item groups The most popular item groupsappear to be ldquoFavor teacher recertification examsrdquo ldquoFavor teacher competency ex-amsrdquo ldquoFavor (student) graduation examsrdquo and ldquoFavor (student) grade promotionexamsrdquo

Effect sizesmdashweighted or unweightedmdashfor these four item groups are all pos-itive and large (ie gt 08) (Cohen 1988) The effect sizes for other item groupssuch as ldquoimprove learningrdquo and ldquoimprove instructionrdquo also are positive and largewhether weighted or unweighted The remaining two item groupsmdashldquofavor moretestingrdquo and ldquohold schools accountable for (student) scoresrdquomdashgarner moderatelylarge positive effect sizes whether weighted or unweighted

Table 8 also breaks out the effect size results by two different groups of re-spondents education providers and consumers Education providers comprise pri-marily administrators and teachers whereas education consumers comprise mostlystudents and noneducator adults

As groups providers and consumers tend to respond similarlymdashstrongly andpositivelymdashto items about test effects (tests ldquoimprove learningrdquo and ldquoimproveinstructionrdquo) Likewise both groups strongly favor high-stakes testing for students(ldquofavor grade promotion examsrdquo and ldquofavor graduation examsrdquo)

Provider and consumer groups differ in their responses in the other item groupshowever For example consumers are overwhelmingly in favor of teacher testingboth competency (ie certification) and recertification exams with weighted effectsizes of 22 and 20 respectively Provider responses are positive but not nearly asstrong with weighted effect sizes of 03 and 04 respectively

Clear differences of opinion between provider and consumer groups can beseen among the remaining item groups too Consumers tend to favor more testing

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

TAB

LE9

Effe

ctS

izes

ofS

urve

yR

espo

nses

By

Item

Gro

upF

orTo

talP

opul

atio

nE

duca

tion

Pro

vide

rsa

ndE

duca

tion

Con

sum

ers

1958

ndash200

81

Tota

lE

duca

tion

Pro

vide

rs2

Edu

cati

onC

onsu

mer

s3

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofE

ffec

tSiz

eE

ffec

tSiz

eN

umbe

rof

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofIt

emTy

pe(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s

Test

ing

Impr

oves

Lea

rnin

g1

30

524

21

20

514

41

30

698

Test

ing

Impr

oves

Inst

ruct

ion

10

07

165

10

06

125

12

10

40Te

stin

gA

lign

sIn

stru

ctio

n0

60

028

Favo

rG

rade

Pro

mot

ion

Exa

ms

13

11

811

21

026

13

11

53Fa

vor

Gra

duat

ion

Exa

ms

14

11

121

12

08

381

41

281

Favo

rM

ore

Test

ing

05

07

72minus0

20

119

07

08

53Fa

vor

Teac

her

Com

pete

ncy

Exa

ms

21

08

611

10

317

25

22

44Fa

vor

Teac

her

Rec

erti

fica

tion

Exa

ms

21

15

222

52

018

Hol

dTe

ache

rsA

ccou

ntab

lefo

rS

core

sminus0

10

031

minus06

minus05

180

60

513

Hol

dS

choo

lsA

ccou

ntab

lefo

rS

core

s0

60

511

10

2minus0

338

09

07

73

Not

es1

Bla

nkce

lls

repr

esen

tnum

bers

ofit

ems

too

smal

lto

repo

rtva

lidl

y2

Edu

cati

onpr

ovid

ers

are

educ

atio

nad

min

istr

ator

spr

inci

pals

tea

cher

sed

ucat

ion

agen

cyof

fici

als

and

univ

ersi

tyfa

cult

yin

educ

atio

npr

ogra

ms

3E

duca

tion

cons

umer

sar

est

uden

tsp

aren

tst

hege

nera

lpu

blic

em

ploy

ers

publ

icof

fici

als

not

wor

king

ined

ucat

ion

agen

cies

and

univ

ersi

tyfa

cult

you

tsid

eof

educ

atio

n

38

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 39

TABLE 10The Effect of Testing on Student Achievement in Qualitative Studies

Direction of Effect Number of Studies Percent of Studies Percent wo Inferred

Positive 204 84 93Positive Inferred 24 10No Change 8 3 4Mixed 5 2 2Negative 3 1 1Total 244 100 100

and holding teachers and schools accountable for student test scores Providerseither are less supportive in these item groups or are opposed For exampleprovidersrsquo weighted effect size for ldquohold teachers accountable for students scoresrdquois a moderately large ndash05

Qualitative Studies

Analysis of the qualitative studies focuses on the direction of the testing effecton achievement or instruction Ninety-three percent of the qualitative studiesanalyzed reported positive effects of testing whereas only 7 reported mixedeffects negative effects or no change

In 24 cases a positive effect on student achievement was not declared butreasonably could be inferred from statements about behavioral changes Onestudy for example reported ldquo[using] results from test to improve courseworkrdquoIf coursework was indeed improved one might reasonably assume that studentachievement benefited as well Whether the ldquopositive inferredrdquo studies are includedin the summary or not however the counts indicate that far more studies findpositive than mixed negative or no effects The main results of this researchsummary are displayed in Table 10

DISCUSSION

One hundred yearsrsquo evidence suggests that testing increases achievement Effectsare moderately to strongly positive for quantitative studies and are very stronglypositive for survey and qualitative studies

The overwhelmingly positive results of the qualitative research review in partic-ular may surprise some readers The results should be considered reliable becausethe base of evidence is so largemdash245 studies conducted over the course of acentury in more than 30 countries

Qualitative studies have been held up by testing opponents as a higher standardfor studies of educational impact Labeling quantitative studies as narrow limited

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 10: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

30 PHELPS

TABLE 5Level of Education Scale Stakes and Target of Stakes in Qualitative Studies

Number of Studies Percent of Studies

Level of EducationTwo or More Levels 115 50Upper Secondary 55 24Elementary 34 15Postsecondary 18 8Lower Secondary 7 3Adult Education 3 1Total 232 101

Scale of TestLarge-scale 196 81Classroom 42 17Test for Teachers 5 2Total 243 100

Stakes of TestHigh 154 63Low 51 21Medium 33 14VariesNot Specified 6 3Total 244 101

Target of StakesStudent 150 62School 80 33Teacher 10 4Varies 1 lt1Total 241 100

Note Percentages may sum to greater than 100 due to rounding

TABLE 6Number of Studies and Effects By Type

Type of Number of Number of Population Coverage Percent of StudiesStudy Studies Effects (in thousands) with US Focus

Quantitative 177 640 7000 89Surveys amp Polls 247 813 700 95Qualitative 245 245 unknown 65Total 669 1698

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 31

Several formulae are employed to calculate effect sizes each selected to alignto its relevant study design The most frequently used is some variation of thestandardized mean difference (Cohen 1988)

d = XT minus XC

sp(1)

where

XT = number in treatment groupXC = number in control groupSp = pooled standard deviation andd = effect size

For those studies that compare proportions rather than means an effect sizecomparable to the standard mean difference effect size is calculated with a log-oddsratio and a logit function transformation (Lipsey amp Wilson 2001 pp 53ndash58)

ES =(

loge

[PT

1 minus PT

]minus loge

[PC

1 minus PC

])183 (2)

PT is the proportion of persons in the treatment group with ldquosuccessfulrdquo out-comes (eg passing a test graduating) and PC is the proportion of persons incontrol group with successful outcomes The logit function transformation is ac-complished by dividing the log-odds ratio by 183

For survey studies the response pattern for each survey is quantitatively sum-marized in one of two ways

bull as frequencies for each point along a Likert scale which can then be summarizedby standard measures of central tendency and dispersion or

bull as frequencies for each multiple choice response option which can then beconverted to percentages

Just over 100 of the almost 800+ survey items reported their responses inmeans and standard deviations on a scale Effects for scale items are calculatedas the difference of the response mean from the scale midpoint divided by thestandard deviation

Close to 700 other survey items with multiple-choice response formats reportedthe frequency and percentage for each choice The analysis collapses these multiplechoices into dichotomous choicesmdashyes and no for and against favorable andunfavorable agree and disagree support and oppose etc Responses that fit neithersidemdashneutral no opinion donrsquot know no responsemdashare not counted An effect

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

32 PHELPS

size is then calculated with the aforementioned log-odds ratio and a logit functiontransformation

Weighting Effect sizes are summarized in two ways with an unweightedmean in which case each effect size counts the same regardless the size of thepopulation under study and with a weighted mean in which case effect sizemeasures calculated on larger groups count for more

For between 10 and 20 of the quantitative studies (depending on how oneaggregates them) that incorporated pre-post comparison designs weights (ieinverse variances) are calculated thus (Lipsey amp Wilson 2001 p 72)

w = 2N

4(1 minus r) + d2(3)

where

N = number in study populationr = correlation coefficient andd = effect size

For the remainder of the studies with standardized mean difference effectsizes weights (ie inverse variances) are calculated thus (Lipsey amp Wilson 2001p 72)

w = 2 times NT times NC(NT + NC)

2(NT + NC)2 + NT times NC times d2(4)

where

NT = number in treatment groupNC = number in control group andd = effect size

For the group of about 100 survey items with Likert-scale responses withstandardized mean effect sizes standard errors and weights (ie inverse variances)are calculated thus

SE = sradicn

w = n

s2(5 6)

wheres = standard deviation andn = number of respondents

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 33

For the group of about 700 items with proportional-difference responses withlog-odds-ratio effect sizes standard errors and weights are calculated thusly

w = 1

SE 2SE =

radic1

a+ 1

b+ 1

c+ 1

d(7 8)

where

a = number of respondents in favorb = number of respondents not in group ac = number of respondents against andd = number of respondents not in group c

Once the weights and standard errors are calculated effect sizes are thenldquoweightedrdquo by the following process (Lipsey amp Wilson 2001 pp 130ndash132)

1 Each effect size is multiplied by its weight wes2 The weighted mean effect size is calculated thusly

ES =sum

wesisumwi

(9)

Standards for judging effect sizes Cohen (1988) proposed that a d lt

020 represented a small effect and a d gt 080 represented a large effect Avalue for d in between then represented a medium effect In his meta-analysis ofmeta-analyses however Hattie (2009) placed the average d for all meta-analysesof student achievement effect studies near 040 with a pseudo-Hawthorne Effectbase level around 010 (ie the level at which any achievement-related interventionnot producing a clear positive or negative effect will settle)

RESULTS

Quantitative Studies

For quantitative studies simple and weighted effect sizes are calculated for sev-eral different aggregations (eg same treatment group control group or studyauthor) Mean effect sizes range from a moderate d asymp 055 to a fairly large d asymp088 depending on the way effects are aggregated or effect sizes are adjusted forstudy artifacts Testing with feedback produces the strongest positive effect on

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

34 PHELPS

achievement Adding stakes or testing with greater frequency also strongly andpositively affects achievement

Studies are classified with dozens of moderators including size of study pop-ulation scale of test administration responsible jurisdiction (eg nation stateclassroom) level of education and primary focus of study (eg memory reten-tion testing frequency mastery testing accountability program)

For the full panoply of 640 different effects the ldquobare bonesrdquo mean d asymp055 After adjustments for small sample sizes (Hedges 1981) and measurementuncertainty in the dependent variable (Hunter amp Schmidt 2004 chapter 8) themean d asymp 071 an apparently stronger effect5

For different aggregations of studies adjusted mean effect sizes vary Groupingstudies by the same treatment or control group or content with differing outcomemeasures (N asymp 600) produces a mean d asymp 073 Grouping studies by the sametreatment or control group with differing content or outcome measures (N asymp 430)produces a mean d asymp 075 Grouping studies by the study author such that a meanis calculated across all of a single authorrsquos studies (N = 160) produces a meand asymp 088

To illustrate the various moderatorsrsquo influence on the effect sizes the same-study-author aggregation is employed Studies in which the treatment group testedmore frequently than the control group produced a mean d asymp 085 Studies in whichthe treatment group was tested with higher stakes than the control group produceda mean d asymp 087 Studies in which the treatment group was made aware of theirtest performance or course progress (and the control group was not) produced amean d asymp 098 Studies in which the treatment group was remediated or receivedsome other type of corrective action based on their test performance produced amean d asymp 096 (See Table 7)

Note that these treatments such as testing more frequently or with higherstakes need not be mutually exclusive can be used complementarily and oftenare This suggests that the optimal intervention would not choose among theseinterventionsmdashall with strong effect sizesmdashbut rather use as many of them aspossible at the same time

TABLE 7Mean Effect Sizes for Various Treatments

Treatment Group Mean Effect Size

is made aware of performance and control group is not 098

receives targeted instruction (eg remediation) 096

is tested with higher stakes than control group 087

is tested more frequently than control group 085

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 35

TABLE 8Source Sponsor Study Design and Scale in Quantitative Studies

Number of Studies Mean Effect Size

Source of TestResearcher or Teacher 87 093National 24 087Commercial 38 082State or District 11 072Total 160

Sponsor of TestInternational 5 102Local 99 093National 45 081State 11 064Total 160

Study DesignPre-post 12 097Experiment Quasi-experiment 107 094Multivariate 26 080Experiment Posttest only 7 060Pre-post (with shadow test) 8 058Total 160

Scale of AnalysisAggregated 9 160Small-scale 118 091Large-scale 33 057Total 160

Scale of Test AdministrationClassroom 115 095Mid-scale 6 072Large-scale 39 071Total 160

Mean effect sizes can be compared between other pairs of moderators too (seeTable 8) For example small population studies such as experiments produce anaverage effect size of (091) whereas large population studies such as those usingnational populations produce an average effect size of (057) Studies of small-scale test administrations (092) produce a larger mean effect size than do studies oflarge-scale test administrations (078) Studies of local test administrations producea larger mean effect size (093) than do studies of state test administrations (064)

Studies conducted at universities (which tend to be small-scale) produce a largermean effect size (102) than do studies conducted at the elementary-secondarylevel (076) (which are often large-scale) Studies in which the intervention occurs

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

36 PHELPS

before the outcome measurement produce a larger mean effect size (093) thando ldquobackwashrdquo studies for which the intervention occurs afterwards (040) (egwhen an increase in lower-secondary test scores is observed after the introductionof a high-stakes upper-secondary examination) Studies in which the subject mattercontent of the intervention and outcome are aligned produce a larger mean effectsize (091) than do studies in which they are not aligned (078)

The quantitative studies vary dramatically in population size (from just 16 to11 million) and their effect sizes tend to correlate negatively The mean for thesmallest quintile among the full 640 effects for example is 104 whereas thatfor the largest quintile is 030 Starting with the adjusted overall mean effect sizeof 071 (for all 640 effects) the mean grows steadily larger as the studies withthe largest populations are removed For all studies with populations less than100000 the mean effect size is 073 under 10000 080 under 1000 083 under500 88 under 100 92 and under 50 106

Generally weighting reduces mean effect sizes further suggesting that thelarger the study population the weaker the results For example the unweightedeffect size of 030 for the largest study population quintile compares to its weightedcounterpart of 028 For the smallest study population quintile the effect size is104 unweighted but only 079 weighted

Survey Studies

Extracting meaning from the 800+ separate survey items required condensationFirst survey items were separated into two overarching groups in which theperception of a testing effect is either explicit or inferred Within the explicitor directly observed category survey items were further classified into two itemtypes those related to improving learning or to improving instruction

Among the 800+ individual survey items analyzed few were worded exactlylike others Often the meaning of two different questions or prompts is verysimilar however even though the wording varies Consider a question posed toUS teachers in the state of North Carolina in 2002 ldquoHave students learned morebecause of the [testing program]rdquo Eighty-three percent responded favorably and12 unfavorably while 5 were neutral (eg no opinion donrsquot know) Thissurvey item is classified in the ldquoimprove learningrdquo item group

Now consider a question posed to teachers in the Canadian Province of Albertain 1990 ldquoThe diploma examinations have positively affected the way in which Iteachrdquo Forty-one percent responded favorably 32 unfavorably and the remain-ing were neutral (or unresponsive) This survey item is classified in the ldquoimproveinstructionrdquo item group

Second the many survey items for which the respondentsrsquo perception of a test-ing effect was inferred were separated into several item groups including ldquoFavorgrade promotion examsrdquo ldquoFavor graduation examsrdquo ldquoFavor teacher competency

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 37

examsrdquo ldquoFavor teacher recertification examsrdquo ldquoFavor more testingrdquo Hold schoolsaccountable for studentsrsquo scoresrdquo and ldquoHold teachers accountable for studentsrsquoscoresrdquo

A perception of a testing effect is inferred when an expression of support for atesting program derives from a belief that the program is beneficial Conversely abelief that a testing program is harmful is inferred from an expression of opposition

For the most basic effects of testing (eg improves learning or instruction) andthe most basic uses of tests (eg graduation certification) effect sizes are largein many cases more than 10 and in some cases more than 20 Effect sizes areweaker for situations in which one group is held accountable for the performanceof anothermdashholding either teachers or schools accountable for student scores

The results vary at most negligibly with study level of rigor or test stakesAnd the results are overwhelmingly positive whether the survey item focusedon testingrsquos affect on improving learning or improving instruction Some of thestudy results can be found in the following tables Table 9 lists the unweightedand weighted effect sizes by item group for the total sample respondent popula-tion With the exception of the item group ldquohold teachers accountable for studentscoresrdquo effect sizes are positive for all item groups The most popular item groupsappear to be ldquoFavor teacher recertification examsrdquo ldquoFavor teacher competency ex-amsrdquo ldquoFavor (student) graduation examsrdquo and ldquoFavor (student) grade promotionexamsrdquo

Effect sizesmdashweighted or unweightedmdashfor these four item groups are all pos-itive and large (ie gt 08) (Cohen 1988) The effect sizes for other item groupssuch as ldquoimprove learningrdquo and ldquoimprove instructionrdquo also are positive and largewhether weighted or unweighted The remaining two item groupsmdashldquofavor moretestingrdquo and ldquohold schools accountable for (student) scoresrdquomdashgarner moderatelylarge positive effect sizes whether weighted or unweighted

Table 8 also breaks out the effect size results by two different groups of re-spondents education providers and consumers Education providers comprise pri-marily administrators and teachers whereas education consumers comprise mostlystudents and noneducator adults

As groups providers and consumers tend to respond similarlymdashstrongly andpositivelymdashto items about test effects (tests ldquoimprove learningrdquo and ldquoimproveinstructionrdquo) Likewise both groups strongly favor high-stakes testing for students(ldquofavor grade promotion examsrdquo and ldquofavor graduation examsrdquo)

Provider and consumer groups differ in their responses in the other item groupshowever For example consumers are overwhelmingly in favor of teacher testingboth competency (ie certification) and recertification exams with weighted effectsizes of 22 and 20 respectively Provider responses are positive but not nearly asstrong with weighted effect sizes of 03 and 04 respectively

Clear differences of opinion between provider and consumer groups can beseen among the remaining item groups too Consumers tend to favor more testing

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

TAB

LE9

Effe

ctS

izes

ofS

urve

yR

espo

nses

By

Item

Gro

upF

orTo

talP

opul

atio

nE

duca

tion

Pro

vide

rsa

ndE

duca

tion

Con

sum

ers

1958

ndash200

81

Tota

lE

duca

tion

Pro

vide

rs2

Edu

cati

onC

onsu

mer

s3

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofE

ffec

tSiz

eE

ffec

tSiz

eN

umbe

rof

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofIt

emTy

pe(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s

Test

ing

Impr

oves

Lea

rnin

g1

30

524

21

20

514

41

30

698

Test

ing

Impr

oves

Inst

ruct

ion

10

07

165

10

06

125

12

10

40Te

stin

gA

lign

sIn

stru

ctio

n0

60

028

Favo

rG

rade

Pro

mot

ion

Exa

ms

13

11

811

21

026

13

11

53Fa

vor

Gra

duat

ion

Exa

ms

14

11

121

12

08

381

41

281

Favo

rM

ore

Test

ing

05

07

72minus0

20

119

07

08

53Fa

vor

Teac

her

Com

pete

ncy

Exa

ms

21

08

611

10

317

25

22

44Fa

vor

Teac

her

Rec

erti

fica

tion

Exa

ms

21

15

222

52

018

Hol

dTe

ache

rsA

ccou

ntab

lefo

rS

core

sminus0

10

031

minus06

minus05

180

60

513

Hol

dS

choo

lsA

ccou

ntab

lefo

rS

core

s0

60

511

10

2minus0

338

09

07

73

Not

es1

Bla

nkce

lls

repr

esen

tnum

bers

ofit

ems

too

smal

lto

repo

rtva

lidl

y2

Edu

cati

onpr

ovid

ers

are

educ

atio

nad

min

istr

ator

spr

inci

pals

tea

cher

sed

ucat

ion

agen

cyof

fici

als

and

univ

ersi

tyfa

cult

yin

educ

atio

npr

ogra

ms

3E

duca

tion

cons

umer

sar

est

uden

tsp

aren

tst

hege

nera

lpu

blic

em

ploy

ers

publ

icof

fici

als

not

wor

king

ined

ucat

ion

agen

cies

and

univ

ersi

tyfa

cult

you

tsid

eof

educ

atio

n

38

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 39

TABLE 10The Effect of Testing on Student Achievement in Qualitative Studies

Direction of Effect Number of Studies Percent of Studies Percent wo Inferred

Positive 204 84 93Positive Inferred 24 10No Change 8 3 4Mixed 5 2 2Negative 3 1 1Total 244 100 100

and holding teachers and schools accountable for student test scores Providerseither are less supportive in these item groups or are opposed For exampleprovidersrsquo weighted effect size for ldquohold teachers accountable for students scoresrdquois a moderately large ndash05

Qualitative Studies

Analysis of the qualitative studies focuses on the direction of the testing effecton achievement or instruction Ninety-three percent of the qualitative studiesanalyzed reported positive effects of testing whereas only 7 reported mixedeffects negative effects or no change

In 24 cases a positive effect on student achievement was not declared butreasonably could be inferred from statements about behavioral changes Onestudy for example reported ldquo[using] results from test to improve courseworkrdquoIf coursework was indeed improved one might reasonably assume that studentachievement benefited as well Whether the ldquopositive inferredrdquo studies are includedin the summary or not however the counts indicate that far more studies findpositive than mixed negative or no effects The main results of this researchsummary are displayed in Table 10

DISCUSSION

One hundred yearsrsquo evidence suggests that testing increases achievement Effectsare moderately to strongly positive for quantitative studies and are very stronglypositive for survey and qualitative studies

The overwhelmingly positive results of the qualitative research review in partic-ular may surprise some readers The results should be considered reliable becausethe base of evidence is so largemdash245 studies conducted over the course of acentury in more than 30 countries

Qualitative studies have been held up by testing opponents as a higher standardfor studies of educational impact Labeling quantitative studies as narrow limited

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 11: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

THE EFFECT OF TESTING ON ACHIEVEMENT 31

Several formulae are employed to calculate effect sizes each selected to alignto its relevant study design The most frequently used is some variation of thestandardized mean difference (Cohen 1988)

d = XT minus XC

sp(1)

where

XT = number in treatment groupXC = number in control groupSp = pooled standard deviation andd = effect size

For those studies that compare proportions rather than means an effect sizecomparable to the standard mean difference effect size is calculated with a log-oddsratio and a logit function transformation (Lipsey amp Wilson 2001 pp 53ndash58)

ES =(

loge

[PT

1 minus PT

]minus loge

[PC

1 minus PC

])183 (2)

PT is the proportion of persons in the treatment group with ldquosuccessfulrdquo out-comes (eg passing a test graduating) and PC is the proportion of persons incontrol group with successful outcomes The logit function transformation is ac-complished by dividing the log-odds ratio by 183

For survey studies the response pattern for each survey is quantitatively sum-marized in one of two ways

bull as frequencies for each point along a Likert scale which can then be summarizedby standard measures of central tendency and dispersion or

bull as frequencies for each multiple choice response option which can then beconverted to percentages

Just over 100 of the almost 800+ survey items reported their responses inmeans and standard deviations on a scale Effects for scale items are calculatedas the difference of the response mean from the scale midpoint divided by thestandard deviation

Close to 700 other survey items with multiple-choice response formats reportedthe frequency and percentage for each choice The analysis collapses these multiplechoices into dichotomous choicesmdashyes and no for and against favorable andunfavorable agree and disagree support and oppose etc Responses that fit neithersidemdashneutral no opinion donrsquot know no responsemdashare not counted An effect

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

32 PHELPS

size is then calculated with the aforementioned log-odds ratio and a logit functiontransformation

Weighting Effect sizes are summarized in two ways with an unweightedmean in which case each effect size counts the same regardless the size of thepopulation under study and with a weighted mean in which case effect sizemeasures calculated on larger groups count for more

For between 10 and 20 of the quantitative studies (depending on how oneaggregates them) that incorporated pre-post comparison designs weights (ieinverse variances) are calculated thus (Lipsey amp Wilson 2001 p 72)

w = 2N

4(1 minus r) + d2(3)

where

N = number in study populationr = correlation coefficient andd = effect size

For the remainder of the studies with standardized mean difference effectsizes weights (ie inverse variances) are calculated thus (Lipsey amp Wilson 2001p 72)

w = 2 times NT times NC(NT + NC)

2(NT + NC)2 + NT times NC times d2(4)

where

NT = number in treatment groupNC = number in control group andd = effect size

For the group of about 100 survey items with Likert-scale responses withstandardized mean effect sizes standard errors and weights (ie inverse variances)are calculated thus

SE = sradicn

w = n

s2(5 6)

wheres = standard deviation andn = number of respondents

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 33

For the group of about 700 items with proportional-difference responses withlog-odds-ratio effect sizes standard errors and weights are calculated thusly

w = 1

SE 2SE =

radic1

a+ 1

b+ 1

c+ 1

d(7 8)

where

a = number of respondents in favorb = number of respondents not in group ac = number of respondents against andd = number of respondents not in group c

Once the weights and standard errors are calculated effect sizes are thenldquoweightedrdquo by the following process (Lipsey amp Wilson 2001 pp 130ndash132)

1 Each effect size is multiplied by its weight wes2 The weighted mean effect size is calculated thusly

ES =sum

wesisumwi

(9)

Standards for judging effect sizes Cohen (1988) proposed that a d lt

020 represented a small effect and a d gt 080 represented a large effect Avalue for d in between then represented a medium effect In his meta-analysis ofmeta-analyses however Hattie (2009) placed the average d for all meta-analysesof student achievement effect studies near 040 with a pseudo-Hawthorne Effectbase level around 010 (ie the level at which any achievement-related interventionnot producing a clear positive or negative effect will settle)

RESULTS

Quantitative Studies

For quantitative studies simple and weighted effect sizes are calculated for sev-eral different aggregations (eg same treatment group control group or studyauthor) Mean effect sizes range from a moderate d asymp 055 to a fairly large d asymp088 depending on the way effects are aggregated or effect sizes are adjusted forstudy artifacts Testing with feedback produces the strongest positive effect on

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

34 PHELPS

achievement Adding stakes or testing with greater frequency also strongly andpositively affects achievement

Studies are classified with dozens of moderators including size of study pop-ulation scale of test administration responsible jurisdiction (eg nation stateclassroom) level of education and primary focus of study (eg memory reten-tion testing frequency mastery testing accountability program)

For the full panoply of 640 different effects the ldquobare bonesrdquo mean d asymp055 After adjustments for small sample sizes (Hedges 1981) and measurementuncertainty in the dependent variable (Hunter amp Schmidt 2004 chapter 8) themean d asymp 071 an apparently stronger effect5

For different aggregations of studies adjusted mean effect sizes vary Groupingstudies by the same treatment or control group or content with differing outcomemeasures (N asymp 600) produces a mean d asymp 073 Grouping studies by the sametreatment or control group with differing content or outcome measures (N asymp 430)produces a mean d asymp 075 Grouping studies by the study author such that a meanis calculated across all of a single authorrsquos studies (N = 160) produces a meand asymp 088

To illustrate the various moderatorsrsquo influence on the effect sizes the same-study-author aggregation is employed Studies in which the treatment group testedmore frequently than the control group produced a mean d asymp 085 Studies in whichthe treatment group was tested with higher stakes than the control group produceda mean d asymp 087 Studies in which the treatment group was made aware of theirtest performance or course progress (and the control group was not) produced amean d asymp 098 Studies in which the treatment group was remediated or receivedsome other type of corrective action based on their test performance produced amean d asymp 096 (See Table 7)

Note that these treatments such as testing more frequently or with higherstakes need not be mutually exclusive can be used complementarily and oftenare This suggests that the optimal intervention would not choose among theseinterventionsmdashall with strong effect sizesmdashbut rather use as many of them aspossible at the same time

TABLE 7Mean Effect Sizes for Various Treatments

Treatment Group Mean Effect Size

is made aware of performance and control group is not 098

receives targeted instruction (eg remediation) 096

is tested with higher stakes than control group 087

is tested more frequently than control group 085

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 35

TABLE 8Source Sponsor Study Design and Scale in Quantitative Studies

Number of Studies Mean Effect Size

Source of TestResearcher or Teacher 87 093National 24 087Commercial 38 082State or District 11 072Total 160

Sponsor of TestInternational 5 102Local 99 093National 45 081State 11 064Total 160

Study DesignPre-post 12 097Experiment Quasi-experiment 107 094Multivariate 26 080Experiment Posttest only 7 060Pre-post (with shadow test) 8 058Total 160

Scale of AnalysisAggregated 9 160Small-scale 118 091Large-scale 33 057Total 160

Scale of Test AdministrationClassroom 115 095Mid-scale 6 072Large-scale 39 071Total 160

Mean effect sizes can be compared between other pairs of moderators too (seeTable 8) For example small population studies such as experiments produce anaverage effect size of (091) whereas large population studies such as those usingnational populations produce an average effect size of (057) Studies of small-scale test administrations (092) produce a larger mean effect size than do studies oflarge-scale test administrations (078) Studies of local test administrations producea larger mean effect size (093) than do studies of state test administrations (064)

Studies conducted at universities (which tend to be small-scale) produce a largermean effect size (102) than do studies conducted at the elementary-secondarylevel (076) (which are often large-scale) Studies in which the intervention occurs

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

36 PHELPS

before the outcome measurement produce a larger mean effect size (093) thando ldquobackwashrdquo studies for which the intervention occurs afterwards (040) (egwhen an increase in lower-secondary test scores is observed after the introductionof a high-stakes upper-secondary examination) Studies in which the subject mattercontent of the intervention and outcome are aligned produce a larger mean effectsize (091) than do studies in which they are not aligned (078)

The quantitative studies vary dramatically in population size (from just 16 to11 million) and their effect sizes tend to correlate negatively The mean for thesmallest quintile among the full 640 effects for example is 104 whereas thatfor the largest quintile is 030 Starting with the adjusted overall mean effect sizeof 071 (for all 640 effects) the mean grows steadily larger as the studies withthe largest populations are removed For all studies with populations less than100000 the mean effect size is 073 under 10000 080 under 1000 083 under500 88 under 100 92 and under 50 106

Generally weighting reduces mean effect sizes further suggesting that thelarger the study population the weaker the results For example the unweightedeffect size of 030 for the largest study population quintile compares to its weightedcounterpart of 028 For the smallest study population quintile the effect size is104 unweighted but only 079 weighted

Survey Studies

Extracting meaning from the 800+ separate survey items required condensationFirst survey items were separated into two overarching groups in which theperception of a testing effect is either explicit or inferred Within the explicitor directly observed category survey items were further classified into two itemtypes those related to improving learning or to improving instruction

Among the 800+ individual survey items analyzed few were worded exactlylike others Often the meaning of two different questions or prompts is verysimilar however even though the wording varies Consider a question posed toUS teachers in the state of North Carolina in 2002 ldquoHave students learned morebecause of the [testing program]rdquo Eighty-three percent responded favorably and12 unfavorably while 5 were neutral (eg no opinion donrsquot know) Thissurvey item is classified in the ldquoimprove learningrdquo item group

Now consider a question posed to teachers in the Canadian Province of Albertain 1990 ldquoThe diploma examinations have positively affected the way in which Iteachrdquo Forty-one percent responded favorably 32 unfavorably and the remain-ing were neutral (or unresponsive) This survey item is classified in the ldquoimproveinstructionrdquo item group

Second the many survey items for which the respondentsrsquo perception of a test-ing effect was inferred were separated into several item groups including ldquoFavorgrade promotion examsrdquo ldquoFavor graduation examsrdquo ldquoFavor teacher competency

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 37

examsrdquo ldquoFavor teacher recertification examsrdquo ldquoFavor more testingrdquo Hold schoolsaccountable for studentsrsquo scoresrdquo and ldquoHold teachers accountable for studentsrsquoscoresrdquo

A perception of a testing effect is inferred when an expression of support for atesting program derives from a belief that the program is beneficial Conversely abelief that a testing program is harmful is inferred from an expression of opposition

For the most basic effects of testing (eg improves learning or instruction) andthe most basic uses of tests (eg graduation certification) effect sizes are largein many cases more than 10 and in some cases more than 20 Effect sizes areweaker for situations in which one group is held accountable for the performanceof anothermdashholding either teachers or schools accountable for student scores

The results vary at most negligibly with study level of rigor or test stakesAnd the results are overwhelmingly positive whether the survey item focusedon testingrsquos affect on improving learning or improving instruction Some of thestudy results can be found in the following tables Table 9 lists the unweightedand weighted effect sizes by item group for the total sample respondent popula-tion With the exception of the item group ldquohold teachers accountable for studentscoresrdquo effect sizes are positive for all item groups The most popular item groupsappear to be ldquoFavor teacher recertification examsrdquo ldquoFavor teacher competency ex-amsrdquo ldquoFavor (student) graduation examsrdquo and ldquoFavor (student) grade promotionexamsrdquo

Effect sizesmdashweighted or unweightedmdashfor these four item groups are all pos-itive and large (ie gt 08) (Cohen 1988) The effect sizes for other item groupssuch as ldquoimprove learningrdquo and ldquoimprove instructionrdquo also are positive and largewhether weighted or unweighted The remaining two item groupsmdashldquofavor moretestingrdquo and ldquohold schools accountable for (student) scoresrdquomdashgarner moderatelylarge positive effect sizes whether weighted or unweighted

Table 8 also breaks out the effect size results by two different groups of re-spondents education providers and consumers Education providers comprise pri-marily administrators and teachers whereas education consumers comprise mostlystudents and noneducator adults

As groups providers and consumers tend to respond similarlymdashstrongly andpositivelymdashto items about test effects (tests ldquoimprove learningrdquo and ldquoimproveinstructionrdquo) Likewise both groups strongly favor high-stakes testing for students(ldquofavor grade promotion examsrdquo and ldquofavor graduation examsrdquo)

Provider and consumer groups differ in their responses in the other item groupshowever For example consumers are overwhelmingly in favor of teacher testingboth competency (ie certification) and recertification exams with weighted effectsizes of 22 and 20 respectively Provider responses are positive but not nearly asstrong with weighted effect sizes of 03 and 04 respectively

Clear differences of opinion between provider and consumer groups can beseen among the remaining item groups too Consumers tend to favor more testing

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

TAB

LE9

Effe

ctS

izes

ofS

urve

yR

espo

nses

By

Item

Gro

upF

orTo

talP

opul

atio

nE

duca

tion

Pro

vide

rsa

ndE

duca

tion

Con

sum

ers

1958

ndash200

81

Tota

lE

duca

tion

Pro

vide

rs2

Edu

cati

onC

onsu

mer

s3

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofE

ffec

tSiz

eE

ffec

tSiz

eN

umbe

rof

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofIt

emTy

pe(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s

Test

ing

Impr

oves

Lea

rnin

g1

30

524

21

20

514

41

30

698

Test

ing

Impr

oves

Inst

ruct

ion

10

07

165

10

06

125

12

10

40Te

stin

gA

lign

sIn

stru

ctio

n0

60

028

Favo

rG

rade

Pro

mot

ion

Exa

ms

13

11

811

21

026

13

11

53Fa

vor

Gra

duat

ion

Exa

ms

14

11

121

12

08

381

41

281

Favo

rM

ore

Test

ing

05

07

72minus0

20

119

07

08

53Fa

vor

Teac

her

Com

pete

ncy

Exa

ms

21

08

611

10

317

25

22

44Fa

vor

Teac

her

Rec

erti

fica

tion

Exa

ms

21

15

222

52

018

Hol

dTe

ache

rsA

ccou

ntab

lefo

rS

core

sminus0

10

031

minus06

minus05

180

60

513

Hol

dS

choo

lsA

ccou

ntab

lefo

rS

core

s0

60

511

10

2minus0

338

09

07

73

Not

es1

Bla

nkce

lls

repr

esen

tnum

bers

ofit

ems

too

smal

lto

repo

rtva

lidl

y2

Edu

cati

onpr

ovid

ers

are

educ

atio

nad

min

istr

ator

spr

inci

pals

tea

cher

sed

ucat

ion

agen

cyof

fici

als

and

univ

ersi

tyfa

cult

yin

educ

atio

npr

ogra

ms

3E

duca

tion

cons

umer

sar

est

uden

tsp

aren

tst

hege

nera

lpu

blic

em

ploy

ers

publ

icof

fici

als

not

wor

king

ined

ucat

ion

agen

cies

and

univ

ersi

tyfa

cult

you

tsid

eof

educ

atio

n

38

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 39

TABLE 10The Effect of Testing on Student Achievement in Qualitative Studies

Direction of Effect Number of Studies Percent of Studies Percent wo Inferred

Positive 204 84 93Positive Inferred 24 10No Change 8 3 4Mixed 5 2 2Negative 3 1 1Total 244 100 100

and holding teachers and schools accountable for student test scores Providerseither are less supportive in these item groups or are opposed For exampleprovidersrsquo weighted effect size for ldquohold teachers accountable for students scoresrdquois a moderately large ndash05

Qualitative Studies

Analysis of the qualitative studies focuses on the direction of the testing effecton achievement or instruction Ninety-three percent of the qualitative studiesanalyzed reported positive effects of testing whereas only 7 reported mixedeffects negative effects or no change

In 24 cases a positive effect on student achievement was not declared butreasonably could be inferred from statements about behavioral changes Onestudy for example reported ldquo[using] results from test to improve courseworkrdquoIf coursework was indeed improved one might reasonably assume that studentachievement benefited as well Whether the ldquopositive inferredrdquo studies are includedin the summary or not however the counts indicate that far more studies findpositive than mixed negative or no effects The main results of this researchsummary are displayed in Table 10

DISCUSSION

One hundred yearsrsquo evidence suggests that testing increases achievement Effectsare moderately to strongly positive for quantitative studies and are very stronglypositive for survey and qualitative studies

The overwhelmingly positive results of the qualitative research review in partic-ular may surprise some readers The results should be considered reliable becausethe base of evidence is so largemdash245 studies conducted over the course of acentury in more than 30 countries

Qualitative studies have been held up by testing opponents as a higher standardfor studies of educational impact Labeling quantitative studies as narrow limited

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 12: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

32 PHELPS

size is then calculated with the aforementioned log-odds ratio and a logit functiontransformation

Weighting Effect sizes are summarized in two ways with an unweightedmean in which case each effect size counts the same regardless the size of thepopulation under study and with a weighted mean in which case effect sizemeasures calculated on larger groups count for more

For between 10 and 20 of the quantitative studies (depending on how oneaggregates them) that incorporated pre-post comparison designs weights (ieinverse variances) are calculated thus (Lipsey amp Wilson 2001 p 72)

w = 2N

4(1 minus r) + d2(3)

where

N = number in study populationr = correlation coefficient andd = effect size

For the remainder of the studies with standardized mean difference effectsizes weights (ie inverse variances) are calculated thus (Lipsey amp Wilson 2001p 72)

w = 2 times NT times NC(NT + NC)

2(NT + NC)2 + NT times NC times d2(4)

where

NT = number in treatment groupNC = number in control group andd = effect size

For the group of about 100 survey items with Likert-scale responses withstandardized mean effect sizes standard errors and weights (ie inverse variances)are calculated thus

SE = sradicn

w = n

s2(5 6)

wheres = standard deviation andn = number of respondents

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 33

For the group of about 700 items with proportional-difference responses withlog-odds-ratio effect sizes standard errors and weights are calculated thusly

w = 1

SE 2SE =

radic1

a+ 1

b+ 1

c+ 1

d(7 8)

where

a = number of respondents in favorb = number of respondents not in group ac = number of respondents against andd = number of respondents not in group c

Once the weights and standard errors are calculated effect sizes are thenldquoweightedrdquo by the following process (Lipsey amp Wilson 2001 pp 130ndash132)

1 Each effect size is multiplied by its weight wes2 The weighted mean effect size is calculated thusly

ES =sum

wesisumwi

(9)

Standards for judging effect sizes Cohen (1988) proposed that a d lt

020 represented a small effect and a d gt 080 represented a large effect Avalue for d in between then represented a medium effect In his meta-analysis ofmeta-analyses however Hattie (2009) placed the average d for all meta-analysesof student achievement effect studies near 040 with a pseudo-Hawthorne Effectbase level around 010 (ie the level at which any achievement-related interventionnot producing a clear positive or negative effect will settle)

RESULTS

Quantitative Studies

For quantitative studies simple and weighted effect sizes are calculated for sev-eral different aggregations (eg same treatment group control group or studyauthor) Mean effect sizes range from a moderate d asymp 055 to a fairly large d asymp088 depending on the way effects are aggregated or effect sizes are adjusted forstudy artifacts Testing with feedback produces the strongest positive effect on

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

34 PHELPS

achievement Adding stakes or testing with greater frequency also strongly andpositively affects achievement

Studies are classified with dozens of moderators including size of study pop-ulation scale of test administration responsible jurisdiction (eg nation stateclassroom) level of education and primary focus of study (eg memory reten-tion testing frequency mastery testing accountability program)

For the full panoply of 640 different effects the ldquobare bonesrdquo mean d asymp055 After adjustments for small sample sizes (Hedges 1981) and measurementuncertainty in the dependent variable (Hunter amp Schmidt 2004 chapter 8) themean d asymp 071 an apparently stronger effect5

For different aggregations of studies adjusted mean effect sizes vary Groupingstudies by the same treatment or control group or content with differing outcomemeasures (N asymp 600) produces a mean d asymp 073 Grouping studies by the sametreatment or control group with differing content or outcome measures (N asymp 430)produces a mean d asymp 075 Grouping studies by the study author such that a meanis calculated across all of a single authorrsquos studies (N = 160) produces a meand asymp 088

To illustrate the various moderatorsrsquo influence on the effect sizes the same-study-author aggregation is employed Studies in which the treatment group testedmore frequently than the control group produced a mean d asymp 085 Studies in whichthe treatment group was tested with higher stakes than the control group produceda mean d asymp 087 Studies in which the treatment group was made aware of theirtest performance or course progress (and the control group was not) produced amean d asymp 098 Studies in which the treatment group was remediated or receivedsome other type of corrective action based on their test performance produced amean d asymp 096 (See Table 7)

Note that these treatments such as testing more frequently or with higherstakes need not be mutually exclusive can be used complementarily and oftenare This suggests that the optimal intervention would not choose among theseinterventionsmdashall with strong effect sizesmdashbut rather use as many of them aspossible at the same time

TABLE 7Mean Effect Sizes for Various Treatments

Treatment Group Mean Effect Size

is made aware of performance and control group is not 098

receives targeted instruction (eg remediation) 096

is tested with higher stakes than control group 087

is tested more frequently than control group 085

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 35

TABLE 8Source Sponsor Study Design and Scale in Quantitative Studies

Number of Studies Mean Effect Size

Source of TestResearcher or Teacher 87 093National 24 087Commercial 38 082State or District 11 072Total 160

Sponsor of TestInternational 5 102Local 99 093National 45 081State 11 064Total 160

Study DesignPre-post 12 097Experiment Quasi-experiment 107 094Multivariate 26 080Experiment Posttest only 7 060Pre-post (with shadow test) 8 058Total 160

Scale of AnalysisAggregated 9 160Small-scale 118 091Large-scale 33 057Total 160

Scale of Test AdministrationClassroom 115 095Mid-scale 6 072Large-scale 39 071Total 160

Mean effect sizes can be compared between other pairs of moderators too (seeTable 8) For example small population studies such as experiments produce anaverage effect size of (091) whereas large population studies such as those usingnational populations produce an average effect size of (057) Studies of small-scale test administrations (092) produce a larger mean effect size than do studies oflarge-scale test administrations (078) Studies of local test administrations producea larger mean effect size (093) than do studies of state test administrations (064)

Studies conducted at universities (which tend to be small-scale) produce a largermean effect size (102) than do studies conducted at the elementary-secondarylevel (076) (which are often large-scale) Studies in which the intervention occurs

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

36 PHELPS

before the outcome measurement produce a larger mean effect size (093) thando ldquobackwashrdquo studies for which the intervention occurs afterwards (040) (egwhen an increase in lower-secondary test scores is observed after the introductionof a high-stakes upper-secondary examination) Studies in which the subject mattercontent of the intervention and outcome are aligned produce a larger mean effectsize (091) than do studies in which they are not aligned (078)

The quantitative studies vary dramatically in population size (from just 16 to11 million) and their effect sizes tend to correlate negatively The mean for thesmallest quintile among the full 640 effects for example is 104 whereas thatfor the largest quintile is 030 Starting with the adjusted overall mean effect sizeof 071 (for all 640 effects) the mean grows steadily larger as the studies withthe largest populations are removed For all studies with populations less than100000 the mean effect size is 073 under 10000 080 under 1000 083 under500 88 under 100 92 and under 50 106

Generally weighting reduces mean effect sizes further suggesting that thelarger the study population the weaker the results For example the unweightedeffect size of 030 for the largest study population quintile compares to its weightedcounterpart of 028 For the smallest study population quintile the effect size is104 unweighted but only 079 weighted

Survey Studies

Extracting meaning from the 800+ separate survey items required condensationFirst survey items were separated into two overarching groups in which theperception of a testing effect is either explicit or inferred Within the explicitor directly observed category survey items were further classified into two itemtypes those related to improving learning or to improving instruction

Among the 800+ individual survey items analyzed few were worded exactlylike others Often the meaning of two different questions or prompts is verysimilar however even though the wording varies Consider a question posed toUS teachers in the state of North Carolina in 2002 ldquoHave students learned morebecause of the [testing program]rdquo Eighty-three percent responded favorably and12 unfavorably while 5 were neutral (eg no opinion donrsquot know) Thissurvey item is classified in the ldquoimprove learningrdquo item group

Now consider a question posed to teachers in the Canadian Province of Albertain 1990 ldquoThe diploma examinations have positively affected the way in which Iteachrdquo Forty-one percent responded favorably 32 unfavorably and the remain-ing were neutral (or unresponsive) This survey item is classified in the ldquoimproveinstructionrdquo item group

Second the many survey items for which the respondentsrsquo perception of a test-ing effect was inferred were separated into several item groups including ldquoFavorgrade promotion examsrdquo ldquoFavor graduation examsrdquo ldquoFavor teacher competency

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 37

examsrdquo ldquoFavor teacher recertification examsrdquo ldquoFavor more testingrdquo Hold schoolsaccountable for studentsrsquo scoresrdquo and ldquoHold teachers accountable for studentsrsquoscoresrdquo

A perception of a testing effect is inferred when an expression of support for atesting program derives from a belief that the program is beneficial Conversely abelief that a testing program is harmful is inferred from an expression of opposition

For the most basic effects of testing (eg improves learning or instruction) andthe most basic uses of tests (eg graduation certification) effect sizes are largein many cases more than 10 and in some cases more than 20 Effect sizes areweaker for situations in which one group is held accountable for the performanceof anothermdashholding either teachers or schools accountable for student scores

The results vary at most negligibly with study level of rigor or test stakesAnd the results are overwhelmingly positive whether the survey item focusedon testingrsquos affect on improving learning or improving instruction Some of thestudy results can be found in the following tables Table 9 lists the unweightedand weighted effect sizes by item group for the total sample respondent popula-tion With the exception of the item group ldquohold teachers accountable for studentscoresrdquo effect sizes are positive for all item groups The most popular item groupsappear to be ldquoFavor teacher recertification examsrdquo ldquoFavor teacher competency ex-amsrdquo ldquoFavor (student) graduation examsrdquo and ldquoFavor (student) grade promotionexamsrdquo

Effect sizesmdashweighted or unweightedmdashfor these four item groups are all pos-itive and large (ie gt 08) (Cohen 1988) The effect sizes for other item groupssuch as ldquoimprove learningrdquo and ldquoimprove instructionrdquo also are positive and largewhether weighted or unweighted The remaining two item groupsmdashldquofavor moretestingrdquo and ldquohold schools accountable for (student) scoresrdquomdashgarner moderatelylarge positive effect sizes whether weighted or unweighted

Table 8 also breaks out the effect size results by two different groups of re-spondents education providers and consumers Education providers comprise pri-marily administrators and teachers whereas education consumers comprise mostlystudents and noneducator adults

As groups providers and consumers tend to respond similarlymdashstrongly andpositivelymdashto items about test effects (tests ldquoimprove learningrdquo and ldquoimproveinstructionrdquo) Likewise both groups strongly favor high-stakes testing for students(ldquofavor grade promotion examsrdquo and ldquofavor graduation examsrdquo)

Provider and consumer groups differ in their responses in the other item groupshowever For example consumers are overwhelmingly in favor of teacher testingboth competency (ie certification) and recertification exams with weighted effectsizes of 22 and 20 respectively Provider responses are positive but not nearly asstrong with weighted effect sizes of 03 and 04 respectively

Clear differences of opinion between provider and consumer groups can beseen among the remaining item groups too Consumers tend to favor more testing

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

TAB

LE9

Effe

ctS

izes

ofS

urve

yR

espo

nses

By

Item

Gro

upF

orTo

talP

opul

atio

nE

duca

tion

Pro

vide

rsa

ndE

duca

tion

Con

sum

ers

1958

ndash200

81

Tota

lE

duca

tion

Pro

vide

rs2

Edu

cati

onC

onsu

mer

s3

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofE

ffec

tSiz

eE

ffec

tSiz

eN

umbe

rof

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofIt

emTy

pe(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s

Test

ing

Impr

oves

Lea

rnin

g1

30

524

21

20

514

41

30

698

Test

ing

Impr

oves

Inst

ruct

ion

10

07

165

10

06

125

12

10

40Te

stin

gA

lign

sIn

stru

ctio

n0

60

028

Favo

rG

rade

Pro

mot

ion

Exa

ms

13

11

811

21

026

13

11

53Fa

vor

Gra

duat

ion

Exa

ms

14

11

121

12

08

381

41

281

Favo

rM

ore

Test

ing

05

07

72minus0

20

119

07

08

53Fa

vor

Teac

her

Com

pete

ncy

Exa

ms

21

08

611

10

317

25

22

44Fa

vor

Teac

her

Rec

erti

fica

tion

Exa

ms

21

15

222

52

018

Hol

dTe

ache

rsA

ccou

ntab

lefo

rS

core

sminus0

10

031

minus06

minus05

180

60

513

Hol

dS

choo

lsA

ccou

ntab

lefo

rS

core

s0

60

511

10

2minus0

338

09

07

73

Not

es1

Bla

nkce

lls

repr

esen

tnum

bers

ofit

ems

too

smal

lto

repo

rtva

lidl

y2

Edu

cati

onpr

ovid

ers

are

educ

atio

nad

min

istr

ator

spr

inci

pals

tea

cher

sed

ucat

ion

agen

cyof

fici

als

and

univ

ersi

tyfa

cult

yin

educ

atio

npr

ogra

ms

3E

duca

tion

cons

umer

sar

est

uden

tsp

aren

tst

hege

nera

lpu

blic

em

ploy

ers

publ

icof

fici

als

not

wor

king

ined

ucat

ion

agen

cies

and

univ

ersi

tyfa

cult

you

tsid

eof

educ

atio

n

38

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 39

TABLE 10The Effect of Testing on Student Achievement in Qualitative Studies

Direction of Effect Number of Studies Percent of Studies Percent wo Inferred

Positive 204 84 93Positive Inferred 24 10No Change 8 3 4Mixed 5 2 2Negative 3 1 1Total 244 100 100

and holding teachers and schools accountable for student test scores Providerseither are less supportive in these item groups or are opposed For exampleprovidersrsquo weighted effect size for ldquohold teachers accountable for students scoresrdquois a moderately large ndash05

Qualitative Studies

Analysis of the qualitative studies focuses on the direction of the testing effecton achievement or instruction Ninety-three percent of the qualitative studiesanalyzed reported positive effects of testing whereas only 7 reported mixedeffects negative effects or no change

In 24 cases a positive effect on student achievement was not declared butreasonably could be inferred from statements about behavioral changes Onestudy for example reported ldquo[using] results from test to improve courseworkrdquoIf coursework was indeed improved one might reasonably assume that studentachievement benefited as well Whether the ldquopositive inferredrdquo studies are includedin the summary or not however the counts indicate that far more studies findpositive than mixed negative or no effects The main results of this researchsummary are displayed in Table 10

DISCUSSION

One hundred yearsrsquo evidence suggests that testing increases achievement Effectsare moderately to strongly positive for quantitative studies and are very stronglypositive for survey and qualitative studies

The overwhelmingly positive results of the qualitative research review in partic-ular may surprise some readers The results should be considered reliable becausethe base of evidence is so largemdash245 studies conducted over the course of acentury in more than 30 countries

Qualitative studies have been held up by testing opponents as a higher standardfor studies of educational impact Labeling quantitative studies as narrow limited

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 13: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

THE EFFECT OF TESTING ON ACHIEVEMENT 33

For the group of about 700 items with proportional-difference responses withlog-odds-ratio effect sizes standard errors and weights are calculated thusly

w = 1

SE 2SE =

radic1

a+ 1

b+ 1

c+ 1

d(7 8)

where

a = number of respondents in favorb = number of respondents not in group ac = number of respondents against andd = number of respondents not in group c

Once the weights and standard errors are calculated effect sizes are thenldquoweightedrdquo by the following process (Lipsey amp Wilson 2001 pp 130ndash132)

1 Each effect size is multiplied by its weight wes2 The weighted mean effect size is calculated thusly

ES =sum

wesisumwi

(9)

Standards for judging effect sizes Cohen (1988) proposed that a d lt

020 represented a small effect and a d gt 080 represented a large effect Avalue for d in between then represented a medium effect In his meta-analysis ofmeta-analyses however Hattie (2009) placed the average d for all meta-analysesof student achievement effect studies near 040 with a pseudo-Hawthorne Effectbase level around 010 (ie the level at which any achievement-related interventionnot producing a clear positive or negative effect will settle)

RESULTS

Quantitative Studies

For quantitative studies simple and weighted effect sizes are calculated for sev-eral different aggregations (eg same treatment group control group or studyauthor) Mean effect sizes range from a moderate d asymp 055 to a fairly large d asymp088 depending on the way effects are aggregated or effect sizes are adjusted forstudy artifacts Testing with feedback produces the strongest positive effect on

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

34 PHELPS

achievement Adding stakes or testing with greater frequency also strongly andpositively affects achievement

Studies are classified with dozens of moderators including size of study pop-ulation scale of test administration responsible jurisdiction (eg nation stateclassroom) level of education and primary focus of study (eg memory reten-tion testing frequency mastery testing accountability program)

For the full panoply of 640 different effects the ldquobare bonesrdquo mean d asymp055 After adjustments for small sample sizes (Hedges 1981) and measurementuncertainty in the dependent variable (Hunter amp Schmidt 2004 chapter 8) themean d asymp 071 an apparently stronger effect5

For different aggregations of studies adjusted mean effect sizes vary Groupingstudies by the same treatment or control group or content with differing outcomemeasures (N asymp 600) produces a mean d asymp 073 Grouping studies by the sametreatment or control group with differing content or outcome measures (N asymp 430)produces a mean d asymp 075 Grouping studies by the study author such that a meanis calculated across all of a single authorrsquos studies (N = 160) produces a meand asymp 088

To illustrate the various moderatorsrsquo influence on the effect sizes the same-study-author aggregation is employed Studies in which the treatment group testedmore frequently than the control group produced a mean d asymp 085 Studies in whichthe treatment group was tested with higher stakes than the control group produceda mean d asymp 087 Studies in which the treatment group was made aware of theirtest performance or course progress (and the control group was not) produced amean d asymp 098 Studies in which the treatment group was remediated or receivedsome other type of corrective action based on their test performance produced amean d asymp 096 (See Table 7)

Note that these treatments such as testing more frequently or with higherstakes need not be mutually exclusive can be used complementarily and oftenare This suggests that the optimal intervention would not choose among theseinterventionsmdashall with strong effect sizesmdashbut rather use as many of them aspossible at the same time

TABLE 7Mean Effect Sizes for Various Treatments

Treatment Group Mean Effect Size

is made aware of performance and control group is not 098

receives targeted instruction (eg remediation) 096

is tested with higher stakes than control group 087

is tested more frequently than control group 085

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 35

TABLE 8Source Sponsor Study Design and Scale in Quantitative Studies

Number of Studies Mean Effect Size

Source of TestResearcher or Teacher 87 093National 24 087Commercial 38 082State or District 11 072Total 160

Sponsor of TestInternational 5 102Local 99 093National 45 081State 11 064Total 160

Study DesignPre-post 12 097Experiment Quasi-experiment 107 094Multivariate 26 080Experiment Posttest only 7 060Pre-post (with shadow test) 8 058Total 160

Scale of AnalysisAggregated 9 160Small-scale 118 091Large-scale 33 057Total 160

Scale of Test AdministrationClassroom 115 095Mid-scale 6 072Large-scale 39 071Total 160

Mean effect sizes can be compared between other pairs of moderators too (seeTable 8) For example small population studies such as experiments produce anaverage effect size of (091) whereas large population studies such as those usingnational populations produce an average effect size of (057) Studies of small-scale test administrations (092) produce a larger mean effect size than do studies oflarge-scale test administrations (078) Studies of local test administrations producea larger mean effect size (093) than do studies of state test administrations (064)

Studies conducted at universities (which tend to be small-scale) produce a largermean effect size (102) than do studies conducted at the elementary-secondarylevel (076) (which are often large-scale) Studies in which the intervention occurs

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

36 PHELPS

before the outcome measurement produce a larger mean effect size (093) thando ldquobackwashrdquo studies for which the intervention occurs afterwards (040) (egwhen an increase in lower-secondary test scores is observed after the introductionof a high-stakes upper-secondary examination) Studies in which the subject mattercontent of the intervention and outcome are aligned produce a larger mean effectsize (091) than do studies in which they are not aligned (078)

The quantitative studies vary dramatically in population size (from just 16 to11 million) and their effect sizes tend to correlate negatively The mean for thesmallest quintile among the full 640 effects for example is 104 whereas thatfor the largest quintile is 030 Starting with the adjusted overall mean effect sizeof 071 (for all 640 effects) the mean grows steadily larger as the studies withthe largest populations are removed For all studies with populations less than100000 the mean effect size is 073 under 10000 080 under 1000 083 under500 88 under 100 92 and under 50 106

Generally weighting reduces mean effect sizes further suggesting that thelarger the study population the weaker the results For example the unweightedeffect size of 030 for the largest study population quintile compares to its weightedcounterpart of 028 For the smallest study population quintile the effect size is104 unweighted but only 079 weighted

Survey Studies

Extracting meaning from the 800+ separate survey items required condensationFirst survey items were separated into two overarching groups in which theperception of a testing effect is either explicit or inferred Within the explicitor directly observed category survey items were further classified into two itemtypes those related to improving learning or to improving instruction

Among the 800+ individual survey items analyzed few were worded exactlylike others Often the meaning of two different questions or prompts is verysimilar however even though the wording varies Consider a question posed toUS teachers in the state of North Carolina in 2002 ldquoHave students learned morebecause of the [testing program]rdquo Eighty-three percent responded favorably and12 unfavorably while 5 were neutral (eg no opinion donrsquot know) Thissurvey item is classified in the ldquoimprove learningrdquo item group

Now consider a question posed to teachers in the Canadian Province of Albertain 1990 ldquoThe diploma examinations have positively affected the way in which Iteachrdquo Forty-one percent responded favorably 32 unfavorably and the remain-ing were neutral (or unresponsive) This survey item is classified in the ldquoimproveinstructionrdquo item group

Second the many survey items for which the respondentsrsquo perception of a test-ing effect was inferred were separated into several item groups including ldquoFavorgrade promotion examsrdquo ldquoFavor graduation examsrdquo ldquoFavor teacher competency

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 37

examsrdquo ldquoFavor teacher recertification examsrdquo ldquoFavor more testingrdquo Hold schoolsaccountable for studentsrsquo scoresrdquo and ldquoHold teachers accountable for studentsrsquoscoresrdquo

A perception of a testing effect is inferred when an expression of support for atesting program derives from a belief that the program is beneficial Conversely abelief that a testing program is harmful is inferred from an expression of opposition

For the most basic effects of testing (eg improves learning or instruction) andthe most basic uses of tests (eg graduation certification) effect sizes are largein many cases more than 10 and in some cases more than 20 Effect sizes areweaker for situations in which one group is held accountable for the performanceof anothermdashholding either teachers or schools accountable for student scores

The results vary at most negligibly with study level of rigor or test stakesAnd the results are overwhelmingly positive whether the survey item focusedon testingrsquos affect on improving learning or improving instruction Some of thestudy results can be found in the following tables Table 9 lists the unweightedand weighted effect sizes by item group for the total sample respondent popula-tion With the exception of the item group ldquohold teachers accountable for studentscoresrdquo effect sizes are positive for all item groups The most popular item groupsappear to be ldquoFavor teacher recertification examsrdquo ldquoFavor teacher competency ex-amsrdquo ldquoFavor (student) graduation examsrdquo and ldquoFavor (student) grade promotionexamsrdquo

Effect sizesmdashweighted or unweightedmdashfor these four item groups are all pos-itive and large (ie gt 08) (Cohen 1988) The effect sizes for other item groupssuch as ldquoimprove learningrdquo and ldquoimprove instructionrdquo also are positive and largewhether weighted or unweighted The remaining two item groupsmdashldquofavor moretestingrdquo and ldquohold schools accountable for (student) scoresrdquomdashgarner moderatelylarge positive effect sizes whether weighted or unweighted

Table 8 also breaks out the effect size results by two different groups of re-spondents education providers and consumers Education providers comprise pri-marily administrators and teachers whereas education consumers comprise mostlystudents and noneducator adults

As groups providers and consumers tend to respond similarlymdashstrongly andpositivelymdashto items about test effects (tests ldquoimprove learningrdquo and ldquoimproveinstructionrdquo) Likewise both groups strongly favor high-stakes testing for students(ldquofavor grade promotion examsrdquo and ldquofavor graduation examsrdquo)

Provider and consumer groups differ in their responses in the other item groupshowever For example consumers are overwhelmingly in favor of teacher testingboth competency (ie certification) and recertification exams with weighted effectsizes of 22 and 20 respectively Provider responses are positive but not nearly asstrong with weighted effect sizes of 03 and 04 respectively

Clear differences of opinion between provider and consumer groups can beseen among the remaining item groups too Consumers tend to favor more testing

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

TAB

LE9

Effe

ctS

izes

ofS

urve

yR

espo

nses

By

Item

Gro

upF

orTo

talP

opul

atio

nE

duca

tion

Pro

vide

rsa

ndE

duca

tion

Con

sum

ers

1958

ndash200

81

Tota

lE

duca

tion

Pro

vide

rs2

Edu

cati

onC

onsu

mer

s3

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofE

ffec

tSiz

eE

ffec

tSiz

eN

umbe

rof

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofIt

emTy

pe(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s

Test

ing

Impr

oves

Lea

rnin

g1

30

524

21

20

514

41

30

698

Test

ing

Impr

oves

Inst

ruct

ion

10

07

165

10

06

125

12

10

40Te

stin

gA

lign

sIn

stru

ctio

n0

60

028

Favo

rG

rade

Pro

mot

ion

Exa

ms

13

11

811

21

026

13

11

53Fa

vor

Gra

duat

ion

Exa

ms

14

11

121

12

08

381

41

281

Favo

rM

ore

Test

ing

05

07

72minus0

20

119

07

08

53Fa

vor

Teac

her

Com

pete

ncy

Exa

ms

21

08

611

10

317

25

22

44Fa

vor

Teac

her

Rec

erti

fica

tion

Exa

ms

21

15

222

52

018

Hol

dTe

ache

rsA

ccou

ntab

lefo

rS

core

sminus0

10

031

minus06

minus05

180

60

513

Hol

dS

choo

lsA

ccou

ntab

lefo

rS

core

s0

60

511

10

2minus0

338

09

07

73

Not

es1

Bla

nkce

lls

repr

esen

tnum

bers

ofit

ems

too

smal

lto

repo

rtva

lidl

y2

Edu

cati

onpr

ovid

ers

are

educ

atio

nad

min

istr

ator

spr

inci

pals

tea

cher

sed

ucat

ion

agen

cyof

fici

als

and

univ

ersi

tyfa

cult

yin

educ

atio

npr

ogra

ms

3E

duca

tion

cons

umer

sar

est

uden

tsp

aren

tst

hege

nera

lpu

blic

em

ploy

ers

publ

icof

fici

als

not

wor

king

ined

ucat

ion

agen

cies

and

univ

ersi

tyfa

cult

you

tsid

eof

educ

atio

n

38

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 39

TABLE 10The Effect of Testing on Student Achievement in Qualitative Studies

Direction of Effect Number of Studies Percent of Studies Percent wo Inferred

Positive 204 84 93Positive Inferred 24 10No Change 8 3 4Mixed 5 2 2Negative 3 1 1Total 244 100 100

and holding teachers and schools accountable for student test scores Providerseither are less supportive in these item groups or are opposed For exampleprovidersrsquo weighted effect size for ldquohold teachers accountable for students scoresrdquois a moderately large ndash05

Qualitative Studies

Analysis of the qualitative studies focuses on the direction of the testing effecton achievement or instruction Ninety-three percent of the qualitative studiesanalyzed reported positive effects of testing whereas only 7 reported mixedeffects negative effects or no change

In 24 cases a positive effect on student achievement was not declared butreasonably could be inferred from statements about behavioral changes Onestudy for example reported ldquo[using] results from test to improve courseworkrdquoIf coursework was indeed improved one might reasonably assume that studentachievement benefited as well Whether the ldquopositive inferredrdquo studies are includedin the summary or not however the counts indicate that far more studies findpositive than mixed negative or no effects The main results of this researchsummary are displayed in Table 10

DISCUSSION

One hundred yearsrsquo evidence suggests that testing increases achievement Effectsare moderately to strongly positive for quantitative studies and are very stronglypositive for survey and qualitative studies

The overwhelmingly positive results of the qualitative research review in partic-ular may surprise some readers The results should be considered reliable becausethe base of evidence is so largemdash245 studies conducted over the course of acentury in more than 30 countries

Qualitative studies have been held up by testing opponents as a higher standardfor studies of educational impact Labeling quantitative studies as narrow limited

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 14: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

34 PHELPS

achievement Adding stakes or testing with greater frequency also strongly andpositively affects achievement

Studies are classified with dozens of moderators including size of study pop-ulation scale of test administration responsible jurisdiction (eg nation stateclassroom) level of education and primary focus of study (eg memory reten-tion testing frequency mastery testing accountability program)

For the full panoply of 640 different effects the ldquobare bonesrdquo mean d asymp055 After adjustments for small sample sizes (Hedges 1981) and measurementuncertainty in the dependent variable (Hunter amp Schmidt 2004 chapter 8) themean d asymp 071 an apparently stronger effect5

For different aggregations of studies adjusted mean effect sizes vary Groupingstudies by the same treatment or control group or content with differing outcomemeasures (N asymp 600) produces a mean d asymp 073 Grouping studies by the sametreatment or control group with differing content or outcome measures (N asymp 430)produces a mean d asymp 075 Grouping studies by the study author such that a meanis calculated across all of a single authorrsquos studies (N = 160) produces a meand asymp 088

To illustrate the various moderatorsrsquo influence on the effect sizes the same-study-author aggregation is employed Studies in which the treatment group testedmore frequently than the control group produced a mean d asymp 085 Studies in whichthe treatment group was tested with higher stakes than the control group produceda mean d asymp 087 Studies in which the treatment group was made aware of theirtest performance or course progress (and the control group was not) produced amean d asymp 098 Studies in which the treatment group was remediated or receivedsome other type of corrective action based on their test performance produced amean d asymp 096 (See Table 7)

Note that these treatments such as testing more frequently or with higherstakes need not be mutually exclusive can be used complementarily and oftenare This suggests that the optimal intervention would not choose among theseinterventionsmdashall with strong effect sizesmdashbut rather use as many of them aspossible at the same time

TABLE 7Mean Effect Sizes for Various Treatments

Treatment Group Mean Effect Size

is made aware of performance and control group is not 098

receives targeted instruction (eg remediation) 096

is tested with higher stakes than control group 087

is tested more frequently than control group 085

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 35

TABLE 8Source Sponsor Study Design and Scale in Quantitative Studies

Number of Studies Mean Effect Size

Source of TestResearcher or Teacher 87 093National 24 087Commercial 38 082State or District 11 072Total 160

Sponsor of TestInternational 5 102Local 99 093National 45 081State 11 064Total 160

Study DesignPre-post 12 097Experiment Quasi-experiment 107 094Multivariate 26 080Experiment Posttest only 7 060Pre-post (with shadow test) 8 058Total 160

Scale of AnalysisAggregated 9 160Small-scale 118 091Large-scale 33 057Total 160

Scale of Test AdministrationClassroom 115 095Mid-scale 6 072Large-scale 39 071Total 160

Mean effect sizes can be compared between other pairs of moderators too (seeTable 8) For example small population studies such as experiments produce anaverage effect size of (091) whereas large population studies such as those usingnational populations produce an average effect size of (057) Studies of small-scale test administrations (092) produce a larger mean effect size than do studies oflarge-scale test administrations (078) Studies of local test administrations producea larger mean effect size (093) than do studies of state test administrations (064)

Studies conducted at universities (which tend to be small-scale) produce a largermean effect size (102) than do studies conducted at the elementary-secondarylevel (076) (which are often large-scale) Studies in which the intervention occurs

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

36 PHELPS

before the outcome measurement produce a larger mean effect size (093) thando ldquobackwashrdquo studies for which the intervention occurs afterwards (040) (egwhen an increase in lower-secondary test scores is observed after the introductionof a high-stakes upper-secondary examination) Studies in which the subject mattercontent of the intervention and outcome are aligned produce a larger mean effectsize (091) than do studies in which they are not aligned (078)

The quantitative studies vary dramatically in population size (from just 16 to11 million) and their effect sizes tend to correlate negatively The mean for thesmallest quintile among the full 640 effects for example is 104 whereas thatfor the largest quintile is 030 Starting with the adjusted overall mean effect sizeof 071 (for all 640 effects) the mean grows steadily larger as the studies withthe largest populations are removed For all studies with populations less than100000 the mean effect size is 073 under 10000 080 under 1000 083 under500 88 under 100 92 and under 50 106

Generally weighting reduces mean effect sizes further suggesting that thelarger the study population the weaker the results For example the unweightedeffect size of 030 for the largest study population quintile compares to its weightedcounterpart of 028 For the smallest study population quintile the effect size is104 unweighted but only 079 weighted

Survey Studies

Extracting meaning from the 800+ separate survey items required condensationFirst survey items were separated into two overarching groups in which theperception of a testing effect is either explicit or inferred Within the explicitor directly observed category survey items were further classified into two itemtypes those related to improving learning or to improving instruction

Among the 800+ individual survey items analyzed few were worded exactlylike others Often the meaning of two different questions or prompts is verysimilar however even though the wording varies Consider a question posed toUS teachers in the state of North Carolina in 2002 ldquoHave students learned morebecause of the [testing program]rdquo Eighty-three percent responded favorably and12 unfavorably while 5 were neutral (eg no opinion donrsquot know) Thissurvey item is classified in the ldquoimprove learningrdquo item group

Now consider a question posed to teachers in the Canadian Province of Albertain 1990 ldquoThe diploma examinations have positively affected the way in which Iteachrdquo Forty-one percent responded favorably 32 unfavorably and the remain-ing were neutral (or unresponsive) This survey item is classified in the ldquoimproveinstructionrdquo item group

Second the many survey items for which the respondentsrsquo perception of a test-ing effect was inferred were separated into several item groups including ldquoFavorgrade promotion examsrdquo ldquoFavor graduation examsrdquo ldquoFavor teacher competency

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 37

examsrdquo ldquoFavor teacher recertification examsrdquo ldquoFavor more testingrdquo Hold schoolsaccountable for studentsrsquo scoresrdquo and ldquoHold teachers accountable for studentsrsquoscoresrdquo

A perception of a testing effect is inferred when an expression of support for atesting program derives from a belief that the program is beneficial Conversely abelief that a testing program is harmful is inferred from an expression of opposition

For the most basic effects of testing (eg improves learning or instruction) andthe most basic uses of tests (eg graduation certification) effect sizes are largein many cases more than 10 and in some cases more than 20 Effect sizes areweaker for situations in which one group is held accountable for the performanceof anothermdashholding either teachers or schools accountable for student scores

The results vary at most negligibly with study level of rigor or test stakesAnd the results are overwhelmingly positive whether the survey item focusedon testingrsquos affect on improving learning or improving instruction Some of thestudy results can be found in the following tables Table 9 lists the unweightedand weighted effect sizes by item group for the total sample respondent popula-tion With the exception of the item group ldquohold teachers accountable for studentscoresrdquo effect sizes are positive for all item groups The most popular item groupsappear to be ldquoFavor teacher recertification examsrdquo ldquoFavor teacher competency ex-amsrdquo ldquoFavor (student) graduation examsrdquo and ldquoFavor (student) grade promotionexamsrdquo

Effect sizesmdashweighted or unweightedmdashfor these four item groups are all pos-itive and large (ie gt 08) (Cohen 1988) The effect sizes for other item groupssuch as ldquoimprove learningrdquo and ldquoimprove instructionrdquo also are positive and largewhether weighted or unweighted The remaining two item groupsmdashldquofavor moretestingrdquo and ldquohold schools accountable for (student) scoresrdquomdashgarner moderatelylarge positive effect sizes whether weighted or unweighted

Table 8 also breaks out the effect size results by two different groups of re-spondents education providers and consumers Education providers comprise pri-marily administrators and teachers whereas education consumers comprise mostlystudents and noneducator adults

As groups providers and consumers tend to respond similarlymdashstrongly andpositivelymdashto items about test effects (tests ldquoimprove learningrdquo and ldquoimproveinstructionrdquo) Likewise both groups strongly favor high-stakes testing for students(ldquofavor grade promotion examsrdquo and ldquofavor graduation examsrdquo)

Provider and consumer groups differ in their responses in the other item groupshowever For example consumers are overwhelmingly in favor of teacher testingboth competency (ie certification) and recertification exams with weighted effectsizes of 22 and 20 respectively Provider responses are positive but not nearly asstrong with weighted effect sizes of 03 and 04 respectively

Clear differences of opinion between provider and consumer groups can beseen among the remaining item groups too Consumers tend to favor more testing

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

TAB

LE9

Effe

ctS

izes

ofS

urve

yR

espo

nses

By

Item

Gro

upF

orTo

talP

opul

atio

nE

duca

tion

Pro

vide

rsa

ndE

duca

tion

Con

sum

ers

1958

ndash200

81

Tota

lE

duca

tion

Pro

vide

rs2

Edu

cati

onC

onsu

mer

s3

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofE

ffec

tSiz

eE

ffec

tSiz

eN

umbe

rof

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofIt

emTy

pe(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s

Test

ing

Impr

oves

Lea

rnin

g1

30

524

21

20

514

41

30

698

Test

ing

Impr

oves

Inst

ruct

ion

10

07

165

10

06

125

12

10

40Te

stin

gA

lign

sIn

stru

ctio

n0

60

028

Favo

rG

rade

Pro

mot

ion

Exa

ms

13

11

811

21

026

13

11

53Fa

vor

Gra

duat

ion

Exa

ms

14

11

121

12

08

381

41

281

Favo

rM

ore

Test

ing

05

07

72minus0

20

119

07

08

53Fa

vor

Teac

her

Com

pete

ncy

Exa

ms

21

08

611

10

317

25

22

44Fa

vor

Teac

her

Rec

erti

fica

tion

Exa

ms

21

15

222

52

018

Hol

dTe

ache

rsA

ccou

ntab

lefo

rS

core

sminus0

10

031

minus06

minus05

180

60

513

Hol

dS

choo

lsA

ccou

ntab

lefo

rS

core

s0

60

511

10

2minus0

338

09

07

73

Not

es1

Bla

nkce

lls

repr

esen

tnum

bers

ofit

ems

too

smal

lto

repo

rtva

lidl

y2

Edu

cati

onpr

ovid

ers

are

educ

atio

nad

min

istr

ator

spr

inci

pals

tea

cher

sed

ucat

ion

agen

cyof

fici

als

and

univ

ersi

tyfa

cult

yin

educ

atio

npr

ogra

ms

3E

duca

tion

cons

umer

sar

est

uden

tsp

aren

tst

hege

nera

lpu

blic

em

ploy

ers

publ

icof

fici

als

not

wor

king

ined

ucat

ion

agen

cies

and

univ

ersi

tyfa

cult

you

tsid

eof

educ

atio

n

38

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 39

TABLE 10The Effect of Testing on Student Achievement in Qualitative Studies

Direction of Effect Number of Studies Percent of Studies Percent wo Inferred

Positive 204 84 93Positive Inferred 24 10No Change 8 3 4Mixed 5 2 2Negative 3 1 1Total 244 100 100

and holding teachers and schools accountable for student test scores Providerseither are less supportive in these item groups or are opposed For exampleprovidersrsquo weighted effect size for ldquohold teachers accountable for students scoresrdquois a moderately large ndash05

Qualitative Studies

Analysis of the qualitative studies focuses on the direction of the testing effecton achievement or instruction Ninety-three percent of the qualitative studiesanalyzed reported positive effects of testing whereas only 7 reported mixedeffects negative effects or no change

In 24 cases a positive effect on student achievement was not declared butreasonably could be inferred from statements about behavioral changes Onestudy for example reported ldquo[using] results from test to improve courseworkrdquoIf coursework was indeed improved one might reasonably assume that studentachievement benefited as well Whether the ldquopositive inferredrdquo studies are includedin the summary or not however the counts indicate that far more studies findpositive than mixed negative or no effects The main results of this researchsummary are displayed in Table 10

DISCUSSION

One hundred yearsrsquo evidence suggests that testing increases achievement Effectsare moderately to strongly positive for quantitative studies and are very stronglypositive for survey and qualitative studies

The overwhelmingly positive results of the qualitative research review in partic-ular may surprise some readers The results should be considered reliable becausethe base of evidence is so largemdash245 studies conducted over the course of acentury in more than 30 countries

Qualitative studies have been held up by testing opponents as a higher standardfor studies of educational impact Labeling quantitative studies as narrow limited

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 15: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

THE EFFECT OF TESTING ON ACHIEVEMENT 35

TABLE 8Source Sponsor Study Design and Scale in Quantitative Studies

Number of Studies Mean Effect Size

Source of TestResearcher or Teacher 87 093National 24 087Commercial 38 082State or District 11 072Total 160

Sponsor of TestInternational 5 102Local 99 093National 45 081State 11 064Total 160

Study DesignPre-post 12 097Experiment Quasi-experiment 107 094Multivariate 26 080Experiment Posttest only 7 060Pre-post (with shadow test) 8 058Total 160

Scale of AnalysisAggregated 9 160Small-scale 118 091Large-scale 33 057Total 160

Scale of Test AdministrationClassroom 115 095Mid-scale 6 072Large-scale 39 071Total 160

Mean effect sizes can be compared between other pairs of moderators too (seeTable 8) For example small population studies such as experiments produce anaverage effect size of (091) whereas large population studies such as those usingnational populations produce an average effect size of (057) Studies of small-scale test administrations (092) produce a larger mean effect size than do studies oflarge-scale test administrations (078) Studies of local test administrations producea larger mean effect size (093) than do studies of state test administrations (064)

Studies conducted at universities (which tend to be small-scale) produce a largermean effect size (102) than do studies conducted at the elementary-secondarylevel (076) (which are often large-scale) Studies in which the intervention occurs

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

36 PHELPS

before the outcome measurement produce a larger mean effect size (093) thando ldquobackwashrdquo studies for which the intervention occurs afterwards (040) (egwhen an increase in lower-secondary test scores is observed after the introductionof a high-stakes upper-secondary examination) Studies in which the subject mattercontent of the intervention and outcome are aligned produce a larger mean effectsize (091) than do studies in which they are not aligned (078)

The quantitative studies vary dramatically in population size (from just 16 to11 million) and their effect sizes tend to correlate negatively The mean for thesmallest quintile among the full 640 effects for example is 104 whereas thatfor the largest quintile is 030 Starting with the adjusted overall mean effect sizeof 071 (for all 640 effects) the mean grows steadily larger as the studies withthe largest populations are removed For all studies with populations less than100000 the mean effect size is 073 under 10000 080 under 1000 083 under500 88 under 100 92 and under 50 106

Generally weighting reduces mean effect sizes further suggesting that thelarger the study population the weaker the results For example the unweightedeffect size of 030 for the largest study population quintile compares to its weightedcounterpart of 028 For the smallest study population quintile the effect size is104 unweighted but only 079 weighted

Survey Studies

Extracting meaning from the 800+ separate survey items required condensationFirst survey items were separated into two overarching groups in which theperception of a testing effect is either explicit or inferred Within the explicitor directly observed category survey items were further classified into two itemtypes those related to improving learning or to improving instruction

Among the 800+ individual survey items analyzed few were worded exactlylike others Often the meaning of two different questions or prompts is verysimilar however even though the wording varies Consider a question posed toUS teachers in the state of North Carolina in 2002 ldquoHave students learned morebecause of the [testing program]rdquo Eighty-three percent responded favorably and12 unfavorably while 5 were neutral (eg no opinion donrsquot know) Thissurvey item is classified in the ldquoimprove learningrdquo item group

Now consider a question posed to teachers in the Canadian Province of Albertain 1990 ldquoThe diploma examinations have positively affected the way in which Iteachrdquo Forty-one percent responded favorably 32 unfavorably and the remain-ing were neutral (or unresponsive) This survey item is classified in the ldquoimproveinstructionrdquo item group

Second the many survey items for which the respondentsrsquo perception of a test-ing effect was inferred were separated into several item groups including ldquoFavorgrade promotion examsrdquo ldquoFavor graduation examsrdquo ldquoFavor teacher competency

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 37

examsrdquo ldquoFavor teacher recertification examsrdquo ldquoFavor more testingrdquo Hold schoolsaccountable for studentsrsquo scoresrdquo and ldquoHold teachers accountable for studentsrsquoscoresrdquo

A perception of a testing effect is inferred when an expression of support for atesting program derives from a belief that the program is beneficial Conversely abelief that a testing program is harmful is inferred from an expression of opposition

For the most basic effects of testing (eg improves learning or instruction) andthe most basic uses of tests (eg graduation certification) effect sizes are largein many cases more than 10 and in some cases more than 20 Effect sizes areweaker for situations in which one group is held accountable for the performanceof anothermdashholding either teachers or schools accountable for student scores

The results vary at most negligibly with study level of rigor or test stakesAnd the results are overwhelmingly positive whether the survey item focusedon testingrsquos affect on improving learning or improving instruction Some of thestudy results can be found in the following tables Table 9 lists the unweightedand weighted effect sizes by item group for the total sample respondent popula-tion With the exception of the item group ldquohold teachers accountable for studentscoresrdquo effect sizes are positive for all item groups The most popular item groupsappear to be ldquoFavor teacher recertification examsrdquo ldquoFavor teacher competency ex-amsrdquo ldquoFavor (student) graduation examsrdquo and ldquoFavor (student) grade promotionexamsrdquo

Effect sizesmdashweighted or unweightedmdashfor these four item groups are all pos-itive and large (ie gt 08) (Cohen 1988) The effect sizes for other item groupssuch as ldquoimprove learningrdquo and ldquoimprove instructionrdquo also are positive and largewhether weighted or unweighted The remaining two item groupsmdashldquofavor moretestingrdquo and ldquohold schools accountable for (student) scoresrdquomdashgarner moderatelylarge positive effect sizes whether weighted or unweighted

Table 8 also breaks out the effect size results by two different groups of re-spondents education providers and consumers Education providers comprise pri-marily administrators and teachers whereas education consumers comprise mostlystudents and noneducator adults

As groups providers and consumers tend to respond similarlymdashstrongly andpositivelymdashto items about test effects (tests ldquoimprove learningrdquo and ldquoimproveinstructionrdquo) Likewise both groups strongly favor high-stakes testing for students(ldquofavor grade promotion examsrdquo and ldquofavor graduation examsrdquo)

Provider and consumer groups differ in their responses in the other item groupshowever For example consumers are overwhelmingly in favor of teacher testingboth competency (ie certification) and recertification exams with weighted effectsizes of 22 and 20 respectively Provider responses are positive but not nearly asstrong with weighted effect sizes of 03 and 04 respectively

Clear differences of opinion between provider and consumer groups can beseen among the remaining item groups too Consumers tend to favor more testing

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

TAB

LE9

Effe

ctS

izes

ofS

urve

yR

espo

nses

By

Item

Gro

upF

orTo

talP

opul

atio

nE

duca

tion

Pro

vide

rsa

ndE

duca

tion

Con

sum

ers

1958

ndash200

81

Tota

lE

duca

tion

Pro

vide

rs2

Edu

cati

onC

onsu

mer

s3

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofE

ffec

tSiz

eE

ffec

tSiz

eN

umbe

rof

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofIt

emTy

pe(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s

Test

ing

Impr

oves

Lea

rnin

g1

30

524

21

20

514

41

30

698

Test

ing

Impr

oves

Inst

ruct

ion

10

07

165

10

06

125

12

10

40Te

stin

gA

lign

sIn

stru

ctio

n0

60

028

Favo

rG

rade

Pro

mot

ion

Exa

ms

13

11

811

21

026

13

11

53Fa

vor

Gra

duat

ion

Exa

ms

14

11

121

12

08

381

41

281

Favo

rM

ore

Test

ing

05

07

72minus0

20

119

07

08

53Fa

vor

Teac

her

Com

pete

ncy

Exa

ms

21

08

611

10

317

25

22

44Fa

vor

Teac

her

Rec

erti

fica

tion

Exa

ms

21

15

222

52

018

Hol

dTe

ache

rsA

ccou

ntab

lefo

rS

core

sminus0

10

031

minus06

minus05

180

60

513

Hol

dS

choo

lsA

ccou

ntab

lefo

rS

core

s0

60

511

10

2minus0

338

09

07

73

Not

es1

Bla

nkce

lls

repr

esen

tnum

bers

ofit

ems

too

smal

lto

repo

rtva

lidl

y2

Edu

cati

onpr

ovid

ers

are

educ

atio

nad

min

istr

ator

spr

inci

pals

tea

cher

sed

ucat

ion

agen

cyof

fici

als

and

univ

ersi

tyfa

cult

yin

educ

atio

npr

ogra

ms

3E

duca

tion

cons

umer

sar

est

uden

tsp

aren

tst

hege

nera

lpu

blic

em

ploy

ers

publ

icof

fici

als

not

wor

king

ined

ucat

ion

agen

cies

and

univ

ersi

tyfa

cult

you

tsid

eof

educ

atio

n

38

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 39

TABLE 10The Effect of Testing on Student Achievement in Qualitative Studies

Direction of Effect Number of Studies Percent of Studies Percent wo Inferred

Positive 204 84 93Positive Inferred 24 10No Change 8 3 4Mixed 5 2 2Negative 3 1 1Total 244 100 100

and holding teachers and schools accountable for student test scores Providerseither are less supportive in these item groups or are opposed For exampleprovidersrsquo weighted effect size for ldquohold teachers accountable for students scoresrdquois a moderately large ndash05

Qualitative Studies

Analysis of the qualitative studies focuses on the direction of the testing effecton achievement or instruction Ninety-three percent of the qualitative studiesanalyzed reported positive effects of testing whereas only 7 reported mixedeffects negative effects or no change

In 24 cases a positive effect on student achievement was not declared butreasonably could be inferred from statements about behavioral changes Onestudy for example reported ldquo[using] results from test to improve courseworkrdquoIf coursework was indeed improved one might reasonably assume that studentachievement benefited as well Whether the ldquopositive inferredrdquo studies are includedin the summary or not however the counts indicate that far more studies findpositive than mixed negative or no effects The main results of this researchsummary are displayed in Table 10

DISCUSSION

One hundred yearsrsquo evidence suggests that testing increases achievement Effectsare moderately to strongly positive for quantitative studies and are very stronglypositive for survey and qualitative studies

The overwhelmingly positive results of the qualitative research review in partic-ular may surprise some readers The results should be considered reliable becausethe base of evidence is so largemdash245 studies conducted over the course of acentury in more than 30 countries

Qualitative studies have been held up by testing opponents as a higher standardfor studies of educational impact Labeling quantitative studies as narrow limited

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 16: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

36 PHELPS

before the outcome measurement produce a larger mean effect size (093) thando ldquobackwashrdquo studies for which the intervention occurs afterwards (040) (egwhen an increase in lower-secondary test scores is observed after the introductionof a high-stakes upper-secondary examination) Studies in which the subject mattercontent of the intervention and outcome are aligned produce a larger mean effectsize (091) than do studies in which they are not aligned (078)

The quantitative studies vary dramatically in population size (from just 16 to11 million) and their effect sizes tend to correlate negatively The mean for thesmallest quintile among the full 640 effects for example is 104 whereas thatfor the largest quintile is 030 Starting with the adjusted overall mean effect sizeof 071 (for all 640 effects) the mean grows steadily larger as the studies withthe largest populations are removed For all studies with populations less than100000 the mean effect size is 073 under 10000 080 under 1000 083 under500 88 under 100 92 and under 50 106

Generally weighting reduces mean effect sizes further suggesting that thelarger the study population the weaker the results For example the unweightedeffect size of 030 for the largest study population quintile compares to its weightedcounterpart of 028 For the smallest study population quintile the effect size is104 unweighted but only 079 weighted

Survey Studies

Extracting meaning from the 800+ separate survey items required condensationFirst survey items were separated into two overarching groups in which theperception of a testing effect is either explicit or inferred Within the explicitor directly observed category survey items were further classified into two itemtypes those related to improving learning or to improving instruction

Among the 800+ individual survey items analyzed few were worded exactlylike others Often the meaning of two different questions or prompts is verysimilar however even though the wording varies Consider a question posed toUS teachers in the state of North Carolina in 2002 ldquoHave students learned morebecause of the [testing program]rdquo Eighty-three percent responded favorably and12 unfavorably while 5 were neutral (eg no opinion donrsquot know) Thissurvey item is classified in the ldquoimprove learningrdquo item group

Now consider a question posed to teachers in the Canadian Province of Albertain 1990 ldquoThe diploma examinations have positively affected the way in which Iteachrdquo Forty-one percent responded favorably 32 unfavorably and the remain-ing were neutral (or unresponsive) This survey item is classified in the ldquoimproveinstructionrdquo item group

Second the many survey items for which the respondentsrsquo perception of a test-ing effect was inferred were separated into several item groups including ldquoFavorgrade promotion examsrdquo ldquoFavor graduation examsrdquo ldquoFavor teacher competency

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 37

examsrdquo ldquoFavor teacher recertification examsrdquo ldquoFavor more testingrdquo Hold schoolsaccountable for studentsrsquo scoresrdquo and ldquoHold teachers accountable for studentsrsquoscoresrdquo

A perception of a testing effect is inferred when an expression of support for atesting program derives from a belief that the program is beneficial Conversely abelief that a testing program is harmful is inferred from an expression of opposition

For the most basic effects of testing (eg improves learning or instruction) andthe most basic uses of tests (eg graduation certification) effect sizes are largein many cases more than 10 and in some cases more than 20 Effect sizes areweaker for situations in which one group is held accountable for the performanceof anothermdashholding either teachers or schools accountable for student scores

The results vary at most negligibly with study level of rigor or test stakesAnd the results are overwhelmingly positive whether the survey item focusedon testingrsquos affect on improving learning or improving instruction Some of thestudy results can be found in the following tables Table 9 lists the unweightedand weighted effect sizes by item group for the total sample respondent popula-tion With the exception of the item group ldquohold teachers accountable for studentscoresrdquo effect sizes are positive for all item groups The most popular item groupsappear to be ldquoFavor teacher recertification examsrdquo ldquoFavor teacher competency ex-amsrdquo ldquoFavor (student) graduation examsrdquo and ldquoFavor (student) grade promotionexamsrdquo

Effect sizesmdashweighted or unweightedmdashfor these four item groups are all pos-itive and large (ie gt 08) (Cohen 1988) The effect sizes for other item groupssuch as ldquoimprove learningrdquo and ldquoimprove instructionrdquo also are positive and largewhether weighted or unweighted The remaining two item groupsmdashldquofavor moretestingrdquo and ldquohold schools accountable for (student) scoresrdquomdashgarner moderatelylarge positive effect sizes whether weighted or unweighted

Table 8 also breaks out the effect size results by two different groups of re-spondents education providers and consumers Education providers comprise pri-marily administrators and teachers whereas education consumers comprise mostlystudents and noneducator adults

As groups providers and consumers tend to respond similarlymdashstrongly andpositivelymdashto items about test effects (tests ldquoimprove learningrdquo and ldquoimproveinstructionrdquo) Likewise both groups strongly favor high-stakes testing for students(ldquofavor grade promotion examsrdquo and ldquofavor graduation examsrdquo)

Provider and consumer groups differ in their responses in the other item groupshowever For example consumers are overwhelmingly in favor of teacher testingboth competency (ie certification) and recertification exams with weighted effectsizes of 22 and 20 respectively Provider responses are positive but not nearly asstrong with weighted effect sizes of 03 and 04 respectively

Clear differences of opinion between provider and consumer groups can beseen among the remaining item groups too Consumers tend to favor more testing

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

TAB

LE9

Effe

ctS

izes

ofS

urve

yR

espo

nses

By

Item

Gro

upF

orTo

talP

opul

atio

nE

duca

tion

Pro

vide

rsa

ndE

duca

tion

Con

sum

ers

1958

ndash200

81

Tota

lE

duca

tion

Pro

vide

rs2

Edu

cati

onC

onsu

mer

s3

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofE

ffec

tSiz

eE

ffec

tSiz

eN

umbe

rof

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofIt

emTy

pe(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s

Test

ing

Impr

oves

Lea

rnin

g1

30

524

21

20

514

41

30

698

Test

ing

Impr

oves

Inst

ruct

ion

10

07

165

10

06

125

12

10

40Te

stin

gA

lign

sIn

stru

ctio

n0

60

028

Favo

rG

rade

Pro

mot

ion

Exa

ms

13

11

811

21

026

13

11

53Fa

vor

Gra

duat

ion

Exa

ms

14

11

121

12

08

381

41

281

Favo

rM

ore

Test

ing

05

07

72minus0

20

119

07

08

53Fa

vor

Teac

her

Com

pete

ncy

Exa

ms

21

08

611

10

317

25

22

44Fa

vor

Teac

her

Rec

erti

fica

tion

Exa

ms

21

15

222

52

018

Hol

dTe

ache

rsA

ccou

ntab

lefo

rS

core

sminus0

10

031

minus06

minus05

180

60

513

Hol

dS

choo

lsA

ccou

ntab

lefo

rS

core

s0

60

511

10

2minus0

338

09

07

73

Not

es1

Bla

nkce

lls

repr

esen

tnum

bers

ofit

ems

too

smal

lto

repo

rtva

lidl

y2

Edu

cati

onpr

ovid

ers

are

educ

atio

nad

min

istr

ator

spr

inci

pals

tea

cher

sed

ucat

ion

agen

cyof

fici

als

and

univ

ersi

tyfa

cult

yin

educ

atio

npr

ogra

ms

3E

duca

tion

cons

umer

sar

est

uden

tsp

aren

tst

hege

nera

lpu

blic

em

ploy

ers

publ

icof

fici

als

not

wor

king

ined

ucat

ion

agen

cies

and

univ

ersi

tyfa

cult

you

tsid

eof

educ

atio

n

38

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 39

TABLE 10The Effect of Testing on Student Achievement in Qualitative Studies

Direction of Effect Number of Studies Percent of Studies Percent wo Inferred

Positive 204 84 93Positive Inferred 24 10No Change 8 3 4Mixed 5 2 2Negative 3 1 1Total 244 100 100

and holding teachers and schools accountable for student test scores Providerseither are less supportive in these item groups or are opposed For exampleprovidersrsquo weighted effect size for ldquohold teachers accountable for students scoresrdquois a moderately large ndash05

Qualitative Studies

Analysis of the qualitative studies focuses on the direction of the testing effecton achievement or instruction Ninety-three percent of the qualitative studiesanalyzed reported positive effects of testing whereas only 7 reported mixedeffects negative effects or no change

In 24 cases a positive effect on student achievement was not declared butreasonably could be inferred from statements about behavioral changes Onestudy for example reported ldquo[using] results from test to improve courseworkrdquoIf coursework was indeed improved one might reasonably assume that studentachievement benefited as well Whether the ldquopositive inferredrdquo studies are includedin the summary or not however the counts indicate that far more studies findpositive than mixed negative or no effects The main results of this researchsummary are displayed in Table 10

DISCUSSION

One hundred yearsrsquo evidence suggests that testing increases achievement Effectsare moderately to strongly positive for quantitative studies and are very stronglypositive for survey and qualitative studies

The overwhelmingly positive results of the qualitative research review in partic-ular may surprise some readers The results should be considered reliable becausethe base of evidence is so largemdash245 studies conducted over the course of acentury in more than 30 countries

Qualitative studies have been held up by testing opponents as a higher standardfor studies of educational impact Labeling quantitative studies as narrow limited

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 17: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

THE EFFECT OF TESTING ON ACHIEVEMENT 37

examsrdquo ldquoFavor teacher recertification examsrdquo ldquoFavor more testingrdquo Hold schoolsaccountable for studentsrsquo scoresrdquo and ldquoHold teachers accountable for studentsrsquoscoresrdquo

A perception of a testing effect is inferred when an expression of support for atesting program derives from a belief that the program is beneficial Conversely abelief that a testing program is harmful is inferred from an expression of opposition

For the most basic effects of testing (eg improves learning or instruction) andthe most basic uses of tests (eg graduation certification) effect sizes are largein many cases more than 10 and in some cases more than 20 Effect sizes areweaker for situations in which one group is held accountable for the performanceof anothermdashholding either teachers or schools accountable for student scores

The results vary at most negligibly with study level of rigor or test stakesAnd the results are overwhelmingly positive whether the survey item focusedon testingrsquos affect on improving learning or improving instruction Some of thestudy results can be found in the following tables Table 9 lists the unweightedand weighted effect sizes by item group for the total sample respondent popula-tion With the exception of the item group ldquohold teachers accountable for studentscoresrdquo effect sizes are positive for all item groups The most popular item groupsappear to be ldquoFavor teacher recertification examsrdquo ldquoFavor teacher competency ex-amsrdquo ldquoFavor (student) graduation examsrdquo and ldquoFavor (student) grade promotionexamsrdquo

Effect sizesmdashweighted or unweightedmdashfor these four item groups are all pos-itive and large (ie gt 08) (Cohen 1988) The effect sizes for other item groupssuch as ldquoimprove learningrdquo and ldquoimprove instructionrdquo also are positive and largewhether weighted or unweighted The remaining two item groupsmdashldquofavor moretestingrdquo and ldquohold schools accountable for (student) scoresrdquomdashgarner moderatelylarge positive effect sizes whether weighted or unweighted

Table 8 also breaks out the effect size results by two different groups of re-spondents education providers and consumers Education providers comprise pri-marily administrators and teachers whereas education consumers comprise mostlystudents and noneducator adults

As groups providers and consumers tend to respond similarlymdashstrongly andpositivelymdashto items about test effects (tests ldquoimprove learningrdquo and ldquoimproveinstructionrdquo) Likewise both groups strongly favor high-stakes testing for students(ldquofavor grade promotion examsrdquo and ldquofavor graduation examsrdquo)

Provider and consumer groups differ in their responses in the other item groupshowever For example consumers are overwhelmingly in favor of teacher testingboth competency (ie certification) and recertification exams with weighted effectsizes of 22 and 20 respectively Provider responses are positive but not nearly asstrong with weighted effect sizes of 03 and 04 respectively

Clear differences of opinion between provider and consumer groups can beseen among the remaining item groups too Consumers tend to favor more testing

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

TAB

LE9

Effe

ctS

izes

ofS

urve

yR

espo

nses

By

Item

Gro

upF

orTo

talP

opul

atio

nE

duca

tion

Pro

vide

rsa

ndE

duca

tion

Con

sum

ers

1958

ndash200

81

Tota

lE

duca

tion

Pro

vide

rs2

Edu

cati

onC

onsu

mer

s3

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofE

ffec

tSiz

eE

ffec

tSiz

eN

umbe

rof

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofIt

emTy

pe(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s

Test

ing

Impr

oves

Lea

rnin

g1

30

524

21

20

514

41

30

698

Test

ing

Impr

oves

Inst

ruct

ion

10

07

165

10

06

125

12

10

40Te

stin

gA

lign

sIn

stru

ctio

n0

60

028

Favo

rG

rade

Pro

mot

ion

Exa

ms

13

11

811

21

026

13

11

53Fa

vor

Gra

duat

ion

Exa

ms

14

11

121

12

08

381

41

281

Favo

rM

ore

Test

ing

05

07

72minus0

20

119

07

08

53Fa

vor

Teac

her

Com

pete

ncy

Exa

ms

21

08

611

10

317

25

22

44Fa

vor

Teac

her

Rec

erti

fica

tion

Exa

ms

21

15

222

52

018

Hol

dTe

ache

rsA

ccou

ntab

lefo

rS

core

sminus0

10

031

minus06

minus05

180

60

513

Hol

dS

choo

lsA

ccou

ntab

lefo

rS

core

s0

60

511

10

2minus0

338

09

07

73

Not

es1

Bla

nkce

lls

repr

esen

tnum

bers

ofit

ems

too

smal

lto

repo

rtva

lidl

y2

Edu

cati

onpr

ovid

ers

are

educ

atio

nad

min

istr

ator

spr

inci

pals

tea

cher

sed

ucat

ion

agen

cyof

fici

als

and

univ

ersi

tyfa

cult

yin

educ

atio

npr

ogra

ms

3E

duca

tion

cons

umer

sar

est

uden

tsp

aren

tst

hege

nera

lpu

blic

em

ploy

ers

publ

icof

fici

als

not

wor

king

ined

ucat

ion

agen

cies

and

univ

ersi

tyfa

cult

you

tsid

eof

educ

atio

n

38

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 39

TABLE 10The Effect of Testing on Student Achievement in Qualitative Studies

Direction of Effect Number of Studies Percent of Studies Percent wo Inferred

Positive 204 84 93Positive Inferred 24 10No Change 8 3 4Mixed 5 2 2Negative 3 1 1Total 244 100 100

and holding teachers and schools accountable for student test scores Providerseither are less supportive in these item groups or are opposed For exampleprovidersrsquo weighted effect size for ldquohold teachers accountable for students scoresrdquois a moderately large ndash05

Qualitative Studies

Analysis of the qualitative studies focuses on the direction of the testing effecton achievement or instruction Ninety-three percent of the qualitative studiesanalyzed reported positive effects of testing whereas only 7 reported mixedeffects negative effects or no change

In 24 cases a positive effect on student achievement was not declared butreasonably could be inferred from statements about behavioral changes Onestudy for example reported ldquo[using] results from test to improve courseworkrdquoIf coursework was indeed improved one might reasonably assume that studentachievement benefited as well Whether the ldquopositive inferredrdquo studies are includedin the summary or not however the counts indicate that far more studies findpositive than mixed negative or no effects The main results of this researchsummary are displayed in Table 10

DISCUSSION

One hundred yearsrsquo evidence suggests that testing increases achievement Effectsare moderately to strongly positive for quantitative studies and are very stronglypositive for survey and qualitative studies

The overwhelmingly positive results of the qualitative research review in partic-ular may surprise some readers The results should be considered reliable becausethe base of evidence is so largemdash245 studies conducted over the course of acentury in more than 30 countries

Qualitative studies have been held up by testing opponents as a higher standardfor studies of educational impact Labeling quantitative studies as narrow limited

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 18: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

TAB

LE9

Effe

ctS

izes

ofS

urve

yR

espo

nses

By

Item

Gro

upF

orTo

talP

opul

atio

nE

duca

tion

Pro

vide

rsa

ndE

duca

tion

Con

sum

ers

1958

ndash200

81

Tota

lE

duca

tion

Pro

vide

rs2

Edu

cati

onC

onsu

mer

s3

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofE

ffec

tSiz

eE

ffec

tSiz

eN

umbe

rof

Eff

ectS

ize

Eff

ectS

ize

Num

ber

ofIt

emTy

pe(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s(U

nwei

ght)

(Wei

ght)

Item

s

Test

ing

Impr

oves

Lea

rnin

g1

30

524

21

20

514

41

30

698

Test

ing

Impr

oves

Inst

ruct

ion

10

07

165

10

06

125

12

10

40Te

stin

gA

lign

sIn

stru

ctio

n0

60

028

Favo

rG

rade

Pro

mot

ion

Exa

ms

13

11

811

21

026

13

11

53Fa

vor

Gra

duat

ion

Exa

ms

14

11

121

12

08

381

41

281

Favo

rM

ore

Test

ing

05

07

72minus0

20

119

07

08

53Fa

vor

Teac

her

Com

pete

ncy

Exa

ms

21

08

611

10

317

25

22

44Fa

vor

Teac

her

Rec

erti

fica

tion

Exa

ms

21

15

222

52

018

Hol

dTe

ache

rsA

ccou

ntab

lefo

rS

core

sminus0

10

031

minus06

minus05

180

60

513

Hol

dS

choo

lsA

ccou

ntab

lefo

rS

core

s0

60

511

10

2minus0

338

09

07

73

Not

es1

Bla

nkce

lls

repr

esen

tnum

bers

ofit

ems

too

smal

lto

repo

rtva

lidl

y2

Edu

cati

onpr

ovid

ers

are

educ

atio

nad

min

istr

ator

spr

inci

pals

tea

cher

sed

ucat

ion

agen

cyof

fici

als

and

univ

ersi

tyfa

cult

yin

educ

atio

npr

ogra

ms

3E

duca

tion

cons

umer

sar

est

uden

tsp

aren

tst

hege

nera

lpu

blic

em

ploy

ers

publ

icof

fici

als

not

wor

king

ined

ucat

ion

agen

cies

and

univ

ersi

tyfa

cult

you

tsid

eof

educ

atio

n

38

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 39

TABLE 10The Effect of Testing on Student Achievement in Qualitative Studies

Direction of Effect Number of Studies Percent of Studies Percent wo Inferred

Positive 204 84 93Positive Inferred 24 10No Change 8 3 4Mixed 5 2 2Negative 3 1 1Total 244 100 100

and holding teachers and schools accountable for student test scores Providerseither are less supportive in these item groups or are opposed For exampleprovidersrsquo weighted effect size for ldquohold teachers accountable for students scoresrdquois a moderately large ndash05

Qualitative Studies

Analysis of the qualitative studies focuses on the direction of the testing effecton achievement or instruction Ninety-three percent of the qualitative studiesanalyzed reported positive effects of testing whereas only 7 reported mixedeffects negative effects or no change

In 24 cases a positive effect on student achievement was not declared butreasonably could be inferred from statements about behavioral changes Onestudy for example reported ldquo[using] results from test to improve courseworkrdquoIf coursework was indeed improved one might reasonably assume that studentachievement benefited as well Whether the ldquopositive inferredrdquo studies are includedin the summary or not however the counts indicate that far more studies findpositive than mixed negative or no effects The main results of this researchsummary are displayed in Table 10

DISCUSSION

One hundred yearsrsquo evidence suggests that testing increases achievement Effectsare moderately to strongly positive for quantitative studies and are very stronglypositive for survey and qualitative studies

The overwhelmingly positive results of the qualitative research review in partic-ular may surprise some readers The results should be considered reliable becausethe base of evidence is so largemdash245 studies conducted over the course of acentury in more than 30 countries

Qualitative studies have been held up by testing opponents as a higher standardfor studies of educational impact Labeling quantitative studies as narrow limited

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 19: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

THE EFFECT OF TESTING ON ACHIEVEMENT 39

TABLE 10The Effect of Testing on Student Achievement in Qualitative Studies

Direction of Effect Number of Studies Percent of Studies Percent wo Inferred

Positive 204 84 93Positive Inferred 24 10No Change 8 3 4Mixed 5 2 2Negative 3 1 1Total 244 100 100

and holding teachers and schools accountable for student test scores Providerseither are less supportive in these item groups or are opposed For exampleprovidersrsquo weighted effect size for ldquohold teachers accountable for students scoresrdquois a moderately large ndash05

Qualitative Studies

Analysis of the qualitative studies focuses on the direction of the testing effecton achievement or instruction Ninety-three percent of the qualitative studiesanalyzed reported positive effects of testing whereas only 7 reported mixedeffects negative effects or no change

In 24 cases a positive effect on student achievement was not declared butreasonably could be inferred from statements about behavioral changes Onestudy for example reported ldquo[using] results from test to improve courseworkrdquoIf coursework was indeed improved one might reasonably assume that studentachievement benefited as well Whether the ldquopositive inferredrdquo studies are includedin the summary or not however the counts indicate that far more studies findpositive than mixed negative or no effects The main results of this researchsummary are displayed in Table 10

DISCUSSION

One hundred yearsrsquo evidence suggests that testing increases achievement Effectsare moderately to strongly positive for quantitative studies and are very stronglypositive for survey and qualitative studies

The overwhelmingly positive results of the qualitative research review in partic-ular may surprise some readers The results should be considered reliable becausethe base of evidence is so largemdash245 studies conducted over the course of acentury in more than 30 countries

Qualitative studies have been held up by testing opponents as a higher standardfor studies of educational impact Labeling quantitative studies as narrow limited

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 20: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

40 PHELPS

or biased they have pointed toward qualitative studies for an allegedly broaderview But the qualitative studies they cite have not investigated the effect of testson student achievement Rather they have focused on popular complaints abouttests such as ldquoteaching to the testrdquo ldquonarrowing the curriculumrdquo and the likeIndeed many of those studies have not considered any possible positive effectsand looked only for effects that would be considered negative

Some believers in the superiority of qualitative research have also pressedfor the inclusion of ldquoconsequential validityrdquo measures in testing and measurementstandards perhaps believing that such would reliably show testing to be on balancenegative Results here show that such beliefs are unwarranted

Other researchers have asserted that there exists no evidence of testing ben-efits using any methodology or at least no evidence for high-stakes educationaltesting (Phelps 2005) The ldquopaucity of researchrdquo belief has spread widely amongresearchers of all types and ideological persuasions

Such should be difficult to believe after this review of the research howeverNot all the studies reviewed here found testing effects and not all of the effectsfound have been positive But studies finding positive effects on achievement existin robust number greatly outnumber those finding negative effects and date backa hundred years

Picking Winners

There may be a temptation for education program planners to pick out an inter-vention producing the ldquotoprdquo effect size and focus all resources and attention on itAs noted earlier studies in which the treatment group was made aware of their testperformance or course progress (and the control group was not) produced a highermean effect size than did those in which the treatment group was remediated (orreceived some other type of targeted instruction) Those studies in turn produceda higher mean effect size than did those in which the treatment group was testedwith higher stakes which in turn produced a higher mean effect size than didthose studies in which the treatment group was tested more frequently than wasthe control group

But more important than ranking these treatments is to notice that all of themproduce robustly large effect sizes and they are not mutually exclusive They can beused complementarily and often are This suggests that the optimal interventionwould not choose among interventions but rather use as many of them as ispractical simultaneously

Varying Results by Scale

Among the quantitative studies those conducted on a smaller-scale tend to producestronger effects than do large-scale studies There are several possible explanations

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 21: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

THE EFFECT OF TESTING ON ACHIEVEMENT 41

for this First larger studies often contain more ldquonoiserdquomdashinformation that maynot be clearly relevant to testing or achievement that muddles the picture Secondunlike designed experiments larger studies are usually not designed to test thestudy hypothesis but rather are happenstance accounts of events with their ownschedule and purpose Third the input and outcome measures in larger studiesoften are not aligned Take for example the common practice of using a single-subject monitoring test (eg in mathematics or English) to measure the effect ofan accountability standard that employs a full battery graduation examination andcourse distribution and completion requirements

Those who judge the effect of testing on achievement exclusively from large-sample multivariate studies deprive themselves of the most focused clear andprecise evidence Some researchers for example have claimed that no studies ofldquotest-based accountabilityrdquo had been conducted before theirs in the 2000s (Phelps2009 pp 114ndash117) This summary includes 24 studies completed before 2000whose primary focus was to measure the effect of ldquotest-based accountabilityrdquo Afew dozen more pre-2000 studies also measured the effect of test-based account-ability although such was not their primary focus Include qualitative studies andprogram evaluation surveys of test-based accountability and the count of pre-2000studies rises into the hundreds

Still the issue of scale may present a conundrum for educational planners Bynecessity some and probably most educational testing must be of large scale Socan planners incorporate the ldquosmall-scalerdquo benefits of testingmdashsuch as masterytesting targeted instruction other types of feedback and perhaps even targetedrewards and sanctionsmdashinto large-scale test administrations Thankfully someintrepid scholars are working on that very issue (see for example Leighton20082009)

The Importance of Surveys

Even if survey data are not the best source of evidence for actual changes in studentachievement due to testing they are certainly a good source of evidence of re-spondentsrsquo preferences And in democratic societies preferences matter Indeedeven if other quantitative studies reported negative effects on student achievementfrom testing political leaders could choose to follow the publicrsquos preference fortesting anyway Indeed public opinion polls sometimes represent the best avail-able method for learning the publicrsquos wishes in democracies Expensive highlycontrolled and monitored political elections do not always attract a representativesample of the citizenry (Keeter 2008)

The effect sizes for survey items on the most basic effects of testing (improveslearning or instruction) and the most basic uses of tests (grade promotion gradua-tion certification and recertification) are very large larger than those from otherquantitative studies A skeptic might interpret this difference in mean effect sizes

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 22: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

42 PHELPS

as evidence that the public exaggerates the positive effect of testing Or one mightadmit that other quantitative studies cannot possibly capture all the factors thatmatter and so might underestimate the positive effect of testing

The ldquobottom linerdquo result of the review of survey studies is very strong supportat least in the United States and Canada for the most basic forms of testing (thathold educators and students accountable for their own performance) regardlessthe respondent group or stakes Moreover the results of this study might beconsidered reliable because the base of evidence is massivemdashalmost three-quartersof a million individual responses

NOTES

1 Testing critics often characterize alignment as deleteriousmdashldquonarrowing the curriculumrdquoand ldquoteaching to the testrdquo are commonly heard phrases But when the content domains of atest match a jurisdictionrsquos required content standards aligning a course of study to the testis eminently responsible behavior2 In 2007 the United Statesrsquo population represented about 73 of the population ofall OECD countries whose primary language is English (ie Australia Canada [primarilyEnglish-speaking population only] Ireland New Zealand United Kingdom United States)Source OECD Factbook 20103 Of the documents set aside a few hundred provide evidence of achievement effects ina manner excluded from this study This ldquooff focusrdquo research includes studies of non-testincentive programs (eg awarding prizes to students or teachers for achievement gains)ldquoopt-outrdquo tests (eg passing a test allows a student to skip a required course thus saving timeand money) benefit-cost analyses effective schools or curricular alignment studies that didnot isolate testingrsquos effect teacher effectiveness studies and conceptual or mathematicalmodels lacking empirical evidence4 The actual effect size was calculated for example comparing two groupsrsquo posttest scores(using pretests or other covariates if assignment was not random) two groupsrsquo post-pre-testgain scores the same grouprsquos posttest or gain scores on tests of two different types or lessfrequently by retrospective matching and pre-post comparison5 One benefit of adjusting effect sizes for measurement uncertainty is to at least in relativeterms reduce any ldquotest-retestrdquo reliability effect on the effect size However only a tinyminority of the experimental studies and a small minority of all studies used the same testpre and post

REFERENCES

Bertsch S Pesta B J Wiscott R amp McDaniel M A (2007) The generation effect A meta-analyticreview Memory amp Cognition 35(2) 201ndash210

Cohen J (1988) Statistical power analysis for the behavioral sciences (Rev Ed) New York NYAcademic Press

Crabtree B F amp Miller W L (1999) Using codes and code manual A template organizing styleof interpretation In B F Crabtree amp W L Miller (Eds) Doing qualitative research (2nd edpp 163ndash178) Thousand Oaks CA Sage

Hattie J A C (2009) Visible learning A synthesis of over 800 meta-analyses relating to achievementLondon UK Routledge

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013

Page 23: The Effect of Testing on Student Achievement, 1910–2010edmeasurement.net/MAG/Phelps 2012 effects of testing.pdf · THE EFFECT OF TESTING ON ACHIEVEMENT 25 have been tested more

THE EFFECT OF TESTING ON ACHIEVEMENT 43

Hedges L V (1981) Distribution theory for Glassrsquos estimator of effect size and related estimatorsJournal of Educational Statistics 6 107ndash128

Hunter J E amp Schmidt F L (2004) Methods of meta-analysis Correcting error and bias in researchfindings Thousand Oaks CA Sage

Keeter S (2008 Autumn) Poll power The Wilson Quarterly 56ndash62King N (1998) Template analysis In G Symon amp C Cassell (Eds) Qualitative methods and analysis

in organizational research A practical guide (pp 118ndash134) London UK SageKing N (2006) What is template analysis University of Huddersfield School of Human and Health

Sciences Retrieved from httpwww2hudacukhhsresearchtemplate analysisindexhtmLeighton J P (20082009) Mistaken impressions of large-scale cognitive diagnostic testing In R

P Phelps (Ed) Correcting fallacies about educational and psychological testing (pp 219ndash246)Washington DC American Psychological Association

Lipsey M W amp Wilson D B (2001) Practical meta-analysis Thousand Oaks CA SageMiles M B amp Huberman A M (1994) Qualitative data analysis An expanded sourcebook Beverly

Hills CA SagePhelps R P (2005) The rich robust research literature on testingrsquos achievement benefits In R P

Phelps (Ed) Defending standardized testing (pp 55ndash90) Mahwah NJ Psychology PressPhelps R P (2009) Education achievement testing Critiques and rebuttals In R P Phelps (Ed)

Correcting fallacies about educational and psychological testing (pp 89ndash146) Washington DCAmerican Psychological Association

Dow

nloa

ded

by [

Uni

vers

ity o

f M

inne

sota

Lib

rari

es T

win

Citi

es]

at 1

849

16

Janu

ary

2013