overconfidence produces underachievement: inaccurate self evaluations undermine students’ learning...

10
Overcondence produces underachievement: Inaccurate self evaluations undermine studentslearning and retention John Dunlosky * , Katherine A. Rawson Kent State University, Psychology Department, Kent, OH 44242, USA article info Article history: Received 2 February 2011 Received in revised form 19 August 2011 Accepted 21 August 2011 Keywords: Judgment accuracy Retention Metacognition Self regulation Metacomprehension abstract The function of accurately monitoring ones own learning is to support effective control of study that enhances learning. Although this link between monitoring accuracy and learning is intuitively plausible and is assumed by general theories of self-regulated learning, it has not received a great deal of empirical scrutiny and no study to date has examined the link between monitoring accuracy and longer-term retention. Across two studies, college students paced their study of key-term denitions (e.g., Proac- tive interference: Information already stored in memory interferes with the learning of new informa- tion). After all denitions were studied, participants completed practice cued recall tests (e.g., What is proactive interference?) in which they attempted to type the correct denition for each term. After each test trial, participants judged how much of their response was correct. These study-test-judgment trials continued until a denition was judged as correct three times. A nal cued recall test occurred two days later. In Study 1, judgment accuracy was manipulated experimentally, and in Study 2, individual differences in accuracy were examined. In both studies, greater accuracy was associated with higher levels of retention, and this link could not be explained by differential feedback, effort during study, or trials to criterion. Results indicate that many students could benet from interventions aimed at improving their skill at judging their learning. Ó 2011 Elsevier Ltd. All rights reserved. 1. Introduction Given the amount of material that students must learn across a wide variety of courses, they should strive to achieve durable learning and also to use their time efciently. To help foster efcient and durable learning, researchers have recommended the use of accurate monitoring to guide learning (e.g., Dunlosky, Hertzog, Kennedy, & Thiede, 2005; Pashler et al., 2007). One reason for this recommendation is that students who are overcondent in their evaluations of learning may fall short of their learning goals, whereas accurate evaluations of ones own learning can be used to more effectively guide further study (Rawson & Dunlosky, in press). Although intuitive, the evidence for this link between over- condence and achievement is limited and may exaggerate the importance of accurate monitoring for effectively achieving durable retention. Before we describe the available evidence, we briey discuss general theories of self-regulated learning, which predict that monitoring accuracy will be related to learning and retention. We then consider previous research in relationship to the present studies, which were primarily designed to answer the following question: Does overcondence produce underachievement when students are learning key-term denitions? 1.1. Is better monitoring accuracy related to better acquisition and retention? 1.1.1. Theory of self-regulated learning Metacognitive theory of self-regulated learning inextricably links monitoring to control (e.g., Boekaerts, 1997; Flavell, 1979; Koriat & Goldsmith, 1996; Metcalfe, 2009; Nelson & Narens, 1990; Winne & Hadwin, 1998). The relationship between monitoring and control was eloquently illustrated by Nelson and Narens (1990; for historical roots, see Miller, Galanter, & Pribram, 1960), who separated cognitive processes into two interrelated levels, called the meta-level and the object-level. Nelson and Narens (1990) proposed. two dominance relations, called controland monitoring, which are dened in terms of the direction of the ow of information between the meta-level and the object-level.The basic notion underlying controldanalogous to speaking into a telephone handsetdis that the meta-level modies the object- level, [and] the basic notion underlying monitoringdanalogous * Corresponding author. Tel.: þ1 330 672 2207; fax: þ1 330 672 3786. E-mail address: [email protected] (J. Dunlosky). Contents lists available at SciVerse ScienceDirect Learning and Instruction journal homepage: www.elsevier.com/locate/learninstruc 0959-4752/$ e see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.learninstruc.2011.08.003 Learning and Instruction 22 (2012) 271e280

Upload: kentstate

Post on 11-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

at SciVerse ScienceDirect

Learning and Instruction 22 (2012) 271e280

Contents lists available

Learning and Instruction

journal homepage: www.elsevier .com/locate/ learninstruc

Overconfidence produces underachievement: Inaccurate self evaluationsundermine students’ learning and retention

John Dunlosky*, Katherine A. RawsonKent State University, Psychology Department, Kent, OH 44242, USA

a r t i c l e i n f o

Article history:Received 2 February 2011Received in revised form19 August 2011Accepted 21 August 2011

Keywords:Judgment accuracyRetentionMetacognitionSelf regulationMetacomprehension

* Corresponding author. Tel.: þ1 330 672 2207; faxE-mail address: [email protected] (J. Dunlosky).

0959-4752/$ e see front matter � 2011 Elsevier Ltd.doi:10.1016/j.learninstruc.2011.08.003

a b s t r a c t

The function of accurately monitoring one’s own learning is to support effective control of study thatenhances learning. Although this link between monitoring accuracy and learning is intuitively plausibleand is assumed by general theories of self-regulated learning, it has not received a great deal of empiricalscrutiny and no study to date has examined the link between monitoring accuracy and longer-termretention. Across two studies, college students paced their study of key-term definitions (e.g., “Proac-tive interference: Information already stored in memory interferes with the learning of new informa-tion”). After all definitions were studied, participants completed practice cued recall tests (e.g., “What isproactive interference?”) in which they attempted to type the correct definition for each term. After eachtest trial, participants judged how much of their response was correct. These study-test-judgment trialscontinued until a definition was judged as correct three times. A final cued recall test occurred two dayslater. In Study 1, judgment accuracy was manipulated experimentally, and in Study 2, individualdifferences in accuracy were examined. In both studies, greater accuracy was associated with higherlevels of retention, and this link could not be explained by differential feedback, effort during study, ortrials to criterion. Results indicate that many students could benefit from interventions aimed atimproving their skill at judging their learning.

� 2011 Elsevier Ltd. All rights reserved.

1. Introduction

Given the amount of material that students must learn acrossa wide variety of courses, they should strive to achieve durablelearning and also to use their time efficiently. To help foster efficientand durable learning, researchers have recommended the use ofaccurate monitoring to guide learning (e.g., Dunlosky, Hertzog,Kennedy, & Thiede, 2005; Pashler et al., 2007). One reason forthis recommendation is that students who are overconfident intheir evaluations of learning may fall short of their learning goals,whereas accurate evaluations of one’s own learning can be used tomore effectively guide further study (Rawson & Dunlosky, in press).Although intuitive, the evidence for this link between over-confidence and achievement is limited and may exaggerate theimportance of accuratemonitoring for effectively achieving durableretention. Before we describe the available evidence, we brieflydiscuss general theories of self-regulated learning, which predictthat monitoring accuracy will be related to learning and retention.We then consider previous research in relationship to the presentstudies, which were primarily designed to answer the following

: þ1 330 672 3786.

All rights reserved.

question: Does overconfidence produce underachievement whenstudents are learning key-term definitions?

1.1. Is better monitoring accuracy related to better acquisition andretention?

1.1.1. Theory of self-regulated learningMetacognitive theory of self-regulated learning inextricably

links monitoring to control (e.g., Boekaerts, 1997; Flavell, 1979;Koriat & Goldsmith, 1996; Metcalfe, 2009; Nelson & Narens, 1990;Winne & Hadwin, 1998). The relationship between monitoringand control was eloquently illustrated by Nelson and Narens (1990;for historical roots, see Miller, Galanter, & Pribram, 1960), whoseparated cognitive processes into two interrelated levels, calledthe meta-level and the object-level. Nelson and Narens (1990)proposed.

two dominance relations, called ‘control’ and ‘monitoring’,which are defined in terms of the direction of the flow ofinformation between the meta-level and the object-level.Thebasic notion underlying controldanalogous to speaking intoa telephone handsetdis that the meta-levelmodifies the object-level, [and] the basic notion underlying monitoringdanalogous

J. Dunlosky, K.A. Rawson / Learning and Instruction 22 (2012) 271e280272

to listening to the telephone handsetdis that the meta-level isinformed by the object-level (pp. 126-127; italics in original).

In this model, the output from control processes (from theobject-level) informs monitoring (cf. Koriat, Ma’ayan, & Nussinson,2006), and most important here, monitoring in turn influencescontrol decisions.

Without this latter link, in which a learner’s assessments of on-going learning are used to make decisions about what and how tostudy, metacognitive monitoring would be inertdan epiphenom-enon perhaps worthy of discussion by philosophers of the mind butnot relevant to understanding how human thought and action iscontrolled (Lieberman, 1979; Nelson, 1996). Fortunately, correla-tional evidence (e.g., Hines, Touron, & Hertzog, 2009) and experi-mental evidence (e.g., Metcalfe & Finn, 2008; Thiede & Dunlosky,1999) converge on the conclusion that learners’ monitoring doesinfluence control decisions about what to study. Whereas theinfluence of monitoring on the control of study has been firmlyestablished, the present studies were designed to evaluate theunder-explored question of whether the accuracy of monitoring isrelated to better retention. It is critical to not conflate monitoringwithmonitoring accuracy. That is, many studies have demonstratedthat learners of all ages and abilities use monitoring to guide theirstudy (for reviews, see Dunlosky& Ariel, 2011; Son & Metcalfe,2000), but the key issue here is whether the accuracy of learner’smonitoring is related to better acquisition and retention.

Because learners use monitoring to efficiently obtain theirlearning goals (for reviews, see Benjamin, 2007; Castel, 2007;Dunlosky & Ariel, 2011), better monitoring accuracy is expected tobe related to more effective learning and higher levels of retention.Although this accuracy-influences-memory (AIM) hypothesisappears relatively straightforward, it relies on two auxiliaryassumptions. First, it assumes that monitoring is used to controllearning in a relatively effective manner (Thiede, 1999), becauseeven if learners’ monitoring were perfectly accurate, if they did not(or could not) appropriately use theirmonitoring to control learningacross items (Son & Sethi, 2006), then better accuracy will notbenefit retention (Nelson, Dunlosky, Graf, & Narens, 1994). Thesecond assumption is a corollary of the first; namely, an associationbetween monitoring accuracy and learning may not be observed iflearners differ in how they use monitoring to control studytimedfor example, a learner whomonitors accurately but uses thismonitoring inappropriately to control study may be outperformedby a learner with somewhat lowermonitoring accuracy whomakesmore effective control decisions (Thiede, 1999). The relevance ofthese assumptions will be highlighted in the following review.

1.1.2. Evidence relevant to the AIM hypothesisOnly a few studies have investigated the link between moni-

toring accuracy and memory (Begg, Martin, & Needham, 1992;Bisanz, Vesonder, & Voss, 1978; Nietfeld, Cao, & Osborne, 2006;Thiede, 1999; Thiede, Anderson, & Therriault, 2003). Begg et al.(1992) had college students study word pairs (e.g., railroad e

mother), and this study trial was followed by a delayed judgment oflearning (JOL) in which the students predicted the chances ofremembering the pair. These JOLs were prompted with either thecue only (railroad e?) or the cue-target pair (railroad e mother);cue-only prompts lead to substantially greater levels of monitoringaccuracy. After making a JOL, the entire pair was presented forrestudy for 4 s, and after all pairs had been restudied, a criterion testof paired-associate recall occurred. Despite the greater accuracy ofcue-only JOLs, test performance was no different when participantsmade cue-only JOLs versus cue-target JOLs. This evidence led Begget al. (1992) to conclude that “memory monitoring does not makea valuable contribution to memory” (p. 212). Note, however, that

the first auxiliary assumption introduced above was not met:Because the restudy trials were brief and experimenter paced,participants could not use monitoring to effectively controllearning. Similarly, Rhodes and Tauber (2011) conducted a meta-analysis on papers examining judgment accuracy, which demon-strated that conditions supporting higher levels of judgmentaccuracy were not related to greater memory performance. Mostrelevant here, the studies in their meta-analysis did not employmethods that allowed participants to use their judgments to controlstudy (in fact, judgmentsweremade after studyingwas completed),so these studies also fail to meet the first auxiliary assumption andhence are not relevant to evaluating the AIM hypothesis.

To sidestep such problems, Thiede (1999) used a proceduresimilar to the one used by Begg et al. (1992), but with two keychanges. First, accuracy was not manipulated between groups;instead, the focus was on whether individual differences in accu-racy were related to better learning. Second, all participants madedelayed cue-only JOLs on the first study-test trial, and during eachof the next four study-judgment-test trials, participants selectedthose items that they wanted to restudy. Thus, participants coulduse their monitoring to control their learning in a (potentially)effective manner. Thiede (1999) computed the relative accuracy ofthe delayed JOLs on Trial 1, which involved correlating eachparticipant’s JOLs with his or her own recall performance acrossitems. As predicted by the AIM hypothesis, individual differences inrelative accuracy on Trial 1 were positively correlated with subse-quent boosts in learning across trials (i.e., participants who werebetter at discriminating items they knew from items they did notyet know profited more from the subsequent opportunities toselect items for restudy). However, the relationship betweenaccuracy and learning arose only on later trials and was relativelyweak. One reason for the weak effect is that participants selecteditems for restudy, so that some benefits of better accuracymay havebeen offset by individual differences in how well monitoring wasused to allocate study time (as per the second auxiliary assumptiondescribed above). Although investigating individual differences inhow learners use monitoring to control study is important, holdingcontrol processes constant across participants can provide a morevalid estimate of the influence of monitoring accuracy on retention.Accordingly, in the present studies, the same algorithmwas used tocontrol study decisions for all participants.

The present studies also critically differ from research by Thiedeand his colleagues (Thiede, 1999; Thiede et al., 2003) in that theyfocused on relative accuracy (see also, Bisanz et al., 1978), whereaswe are focusing on absolute accuracy. Relative accuracy refers to thedegree to which judgments predict which items (relative to others)are more likely to be recalled on the criterion test and is measuredby the aforementioned intra-individual correlations. By contrast,absolute accuracy refers to the extent to which people’s judgmentsdemonstrate overconfidence (judgments are more optimistic thanactual performance) versus underconfidence (judgments are lessoptimistic than actual performance; for details about otherconceptualizations of absolute accuracy, see Boekaerts & Rozendaal,2010; Keren, 1991; Schraw, 2009). Arguably, absolute accuracy iscritical for obtaining efficient and durable learning. The theoreticalclaim here is straightforward, but it has major implications forstudent success. Overconfidence should lead to premature termi-nation of study and hence yield lower levels of learning duringpractice, which in turn will translate to lower levels of retention onthe final test. Put differently, students who are more overconfidentare not expected to learn as much during practice as compared tostudents who are better calibrated, and these initial differences inlearning are expected to persist on the final retention test.

The relationship between absolute accuracy and performancehas been investigated in two studies (Nietfeld et al., 2006; Shin,

J. Dunlosky, K.A. Rawson / Learning and Instruction 22 (2012) 271e280 273

Bjorklund, & Beck, 2007). As we describe next, these studies variedwidely with regard to the participants’ ages and the tasks used, andthey supported different outcomes, so their conclusions should beconsidered with some caution. Nietfeld et al. (2006) trainedstudents in one class tomake more accurate confidence judgments,and as compared to an untrained class, training improved theabsolute accuracy of students’ confidence judgments. Theseimprovements in absolute accuracy were also related to higherscores on the final exam. Although promising, training covariedwith classroom assignment, so some uncontrolled variable couldpotentially account for these positive results. Shin et al. (2007) hadkindergarteners, first-graders, and third-graders study 15 picturesand predict how many they would recall. The children were highlyoverconfident, and someweremore overconfident than others. Thechildren studied and attempted to recall different pictures forseveral more trials, and most important, overconfidence wasrelated to learning performance. In this case, however, over-confidence led to less loss in performance across trials, because themost overconfident children presumably had greater task self-efficacy, worked more diligently, and used better strategies. Thisrelationship held for the youngest children but was less evident forthird graders. Based on these outcomes, Bjorklund, Periss, andCausey (2009) concluded that “at least for younger children, over-estimating one’s abilities. is associated with greater gains incognitive performance than for children who are more in touchwith their cognitive abilities” (p. 130).

In sum, currently the best evidence establishing a positiverelationship between accuracy and memory is available fromstudies that have investigated relative accuracy. Previous studiesinvolving absolute accuracydwhich is our present focusdhaveeither not provided definitive support or have disconfirmed theAIM hypothesis, albeit with younger childrenwhose outcomes maynot generalize to older students. Moreover, no studies that supportdefinitive outcomes have investigated longer-term retention; thatis, previous research has largely explored boosts in performanceduring or immediately after learning. A concern here is thatalthough better accuracy may have led to faster learning in theprevious studies (e.g., Thiede, 1999), it may support less durableretention. Consistent with this possibility, somemanipulations thatsupport more rapid learning actually lead to poorer long-termretention (Schmidt & Bjork, 1992).

1.1.3. Overview of the present studiesTo evaluate the AIM hypothesis, we used a method designed to

overcome the limitations of previous research. In both studies,college students were presented with key term definitions for aninitial study trial followed by cued-recall practice tests in whichtheywere presented with the key term and prompted to type in thedefinition. After each test trial, participants evaluated the quality oftheir recall response by judging whether the response was entirelycorrect, partially correct or incorrect. Immediately after a partici-pant scored a response, the correct definition was presented forrestudy. Importantly, for reasons noted above, the decision ofwhether to continue practicing a given definition was computercontrolled (cf. Atkinson, 1972; Hays, Kornell, & Bjork, 2011; Nelsonet al., 1994), using an algorithm that is known to yield relativelygood long-term retention after a single study session (Rawson &Dunlosky, 2011). In particular, an item continued to be presentedfor test-judgment-restudy trials until a participant judged his orher recall of the definition as correct three times. That is, a partici-pant would need to have judged his or her recall of a definitionentirely correct on three different trials, and once they judged itentirely correct on the third trial, that particular key term definitionwas dropped from further practice. The practice session continueduntil all items were dropped from practice. By using this computer

algorithm to make control decisions, this procedure met the twoauxiliary assumptions (monitoring was used to effectively regulatestudy and all participants used the same regulation algorithm)relevant to isolating the relationship between absolute accuracyand retention. Finally, to examine longer-term retention than inprior research, we administered the final retention test two daysafter the practice session.

In Study 1, monitoring accuracy was manipulated by usingdifferent judgment prompts (as in Begg et al., 1992), and in Study 2,individual differences in accuracy were investigated (as in Thiede,1999). The key issue was whether the absolute accuracy of judg-ments was related to performance on the final retention test. Onone hand, inaccurate judgments may lead to better retention.People who are underconfident (e.g., judge that correct responsesare incorrect) may benefit from overlearning, which could lead tobetter retention. On the other hand, inaccurate judgments may bedetrimental if students judge that incorrect responses are correct,because practice may be terminated before the definitions havebeen learned well enough to be retained. Given that commissionerrors are common when students learn key-term definitions (e.g.,Kikas, 1998; Rawson & Dunlosky, 2007), we expected that anybenefit of underconfidence would be offset by a detrimentalinfluence of overconfidence.

2. Study 1

In Study 1, after initial study of each definition, items werepresented for test-judgment-study trials until they met a learningcriterion, which was informed by participants’ judgments ona three-point scale. Namely, after participants attempted to recalla given definition, they scored their response as entirely correct,partially correct, or incorrect (for rationale, see Dunlosky, Hartwig,Rawson, & Lipko, 2011). Judgment accuracy was influenced bymanipulating the kind of standard participants used when judgingtheir learningdeither an idea-unit standard (idea-unit group) or nostandard of evaluation (no-standard group). In the no-standardgroup, the participant was shown his or her response again andmade a self-score judgment without access to the correct defini-tion. For the idea-unit group, participants were shown theirresponse along with the idea units contained in the correct answer,and they were asked to identify which idea units were contained intheir response. After making the idea-unit judgments, participantsmade a self-score judgment. As compared to using no standard,idea-unit standards support more accurate self-score judgments,particularly in terms of reducing overconfidence (Lipko et al.,2009). Assuming self-score judgments are more accurate in theidea-unit group than in the no-standard group, the key predictionof the AIM hypothesis is that retentionwill be greater for thosewhoused idea-unit standards than for those who received no standard.

2.1. Methods

2.1.1. Participants and designForty-eight undergraduate students from Kent State University

participated to receive course credit in Introductory Psychology.Participants were randomly assigned to one of two groups, the no-standard group (n ¼ 26) or the idea-unit group (n ¼ 22).

2.1.2. MaterialsMaterials included a short passage on social attributions adap-

ted from Introductory Psychology textbooks. The passage includedsix key terms and their definitions (e.g., “The just-world hypothesisis the strong desire or need people have to believe that the world isan orderly, predictable, and just place, where people get what theydeserve”). Although adding definitions may have produced more

Fig. 1. Underconfidence ¼ percent correct recall when definitions received a “nocredit” response. Overconfidence ¼ 100 minus the percent correct recall when defi-nitions received a “full credit” response (higher values indicate more overconfidence).See text for details.

J. Dunlosky, K.A. Rawson / Learning and Instruction 22 (2012) 271e280274

stable results, we used six definitions because pilot research indi-cated that students could reach criterion within a reasonableamount of time (i.e., less than an hour).

2.1.3. ProcedureParticipants were instructed to do their best to learn each

concept and were informed that they would later take a final test inwhich theywould be asked to recall the definition of each key term.To familiarize participants with the to-be-learned material, partic-ipantswerefirst presentedwith the passage to study for 4min. Eachof the key term definitions was then presented one at a time for aninitial self-paced study trial. After the initial study phase, the prac-tice session began in which items were presented one at a time fortest-judge-restudy trials. On each trial, a key term was presented(e.g., “What is the just-world hypothesis?”), and participants wereinstructed to type as much of the definition as possible. Eachparticipant then scoredhis orherownresponse. For theno-standardgroup, the participant’s response was presented along with theprompt for the self-score judgment: “If the quality of your responsewas being graded, do you think you would receive no credit, partialcredit, or full credit?” Three response buttons were presented at thebottom of the screen for participants to indicate their judgment. Forthe idea-unit group, the participant’s response was presented alongwith the idea units from the correct definition. A yes/no check boxwas presented next to each idea unit, and the participant wasinstructed to indicate if the ideaunitwas present in his/her response(for details on the construction of idea units, see Lipko et al., 2009).After making the idea-unit judgments, participants were then pre-sented with the self-score judgment prompt and response buttons(the participant’s response and the idea-unit judgments alsoremained on the screen). Both groupswere instructed tomake theirjudgments as accurately as possible. In both groups, after partici-pantsmade their self-score judgment, the correct definition for thatkey termwas presented for self-paced restudy.

The computer program used each participant’s self-score judg-ments to make a decision about when to terminate practice trialsfor each item. On a trial in which a participant did not judgea response to be correct, that item was placed at the end of the listfor another test-judge-study trial. On trials in which a participantjudged a response as correct, if it was the first or second time thatitem had been judged as correctly recalled, the itemwas also placedat the end of the list for another practice trial later. If it was the thirdtime an item had been judged as correctly recalled, the item wasremoved and received no further practice. We chose three correcttrials as the criterion because other research suggests that fewerthan three correct recalls yields poorer retention but more thanthree produces minimal additional benefit (Rawson & Dunlosky,2011). By using this algorithm, the two auxiliary assumptions ofthe AIM hypothesis were met, which provided the strongest test forevaluating whether the absolute accuracy of learners’ judgments isrelated to retention. Put differently, an advantage of this algorithmis that it does not require all participants to study the definitionsthe same amount of time or the same number of trials (as in almostall prior research).

After all items reached criterion during the practice session, theparticipants were excused and reminded to return two days later tocomplete the retention test. During the final test, each key termwaspresented one at a time, and the participants were instructed totype as much of the correct definition as possible.

2.2. Results

Each response was assigned a recall score based on theproportion of idea units from the definition contained in theresponse. Responses were counted as correct if they included either

verbatim restatements or paraphrases that preserved the meaningof the definition.

2.2.1. Performance on the final retention testMean score across individual percent correct recall was

computed (M across groups ¼ 41, SD ¼ 29.5, range ¼ 0e100). Aspredicted, final retention of the concepts was higher in magnitudefor the idea-unit group (M ¼ 50, SEM ¼ 7) than for the no-standardgroup (M¼ 33, SEM¼ 5), t(44)¼ 1.92, p< .05. According to the AIMhypothesis, this boost in retention was due to the idea-unit groupobtaining better absolute accuracy (in terms of reduced over-confidence), which we examine next.

2.2.2. Absolute accuracy of the self-score judgmentsTwo aspects of absolute accuracy were evaluated, under-

confidence and overconfidence (for details on various measures ofabsolute accuracy, see Keren, 1991). Measures of each can beobtained by computing the percentage of idea units recalled con-ditionalized on self-score judgments of “no credit” (for under-confidence) and “full credit” (for overconfidence). For items thatreceived a self-score of “no credit,” recall scores greater than zeroindicate underconfidence. Similarly, for items that received a self-score of “full credit,” recall scores less than 100 indicate over-confidence (i.e., indicating that participants judged their responsesas correct when they did not correctly recall the entire definition).

Mean scores for responses given a “no-credit” judgment arepresented in the left bars of Fig.1. In both groups, when participantsindicated that their response was incorrect, their actual perfor-mance was significantly greater than 0, ts> 2.0, demonstratingsome underconfidence in judging their responses. However, thedegree of underconfidence for both groups was slight, and the twogroups did not significantly differ, t(32) ¼ .24. The correspondingmeans for responses given a “full credit” judgment are presented inthe right bars of Fig. 1. Note that we subtracted this value from 100,so that higher values (percent incorrectly recalled) indicated moreoverconfidence. When participants indicated that their responsewas entirely correct, their incorrect performance was significantlygreater than 0, ts> 7.5, indicating the both groups were over-confident. As expected, participants who used idea-unit standardswere significantly less overconfident than were those who had nostandard of evaluation, t(44) ¼ 2.67, p ¼ .01.

J. Dunlosky, K.A. Rawson / Learning and Instruction 22 (2012) 271e280 275

Scores conditionalized on self-score judgments of “partialcredit” are not interpretable with respect to over/underconfidence,because the absolute value of performance that participants assignto “partial” is unknown. Nevertheless, for completeness, we presentthis conditional value as well: 12 (SE¼ 3) for the no-standard groupand 30 (SE ¼ 4) for the idea-unit group, t(44) ¼ 3.74, p < .05.

2.2.3. Performance on the final practice trialAccording to the AIM hypothesis, greater overconfidence was

expected to influence final retention by limiting the initial level ofobjective learning during the practice session. The causal mecha-nism is straightforward but important: Overconfidence shouldlead to premature termination of practice (as indicated by lowerlevels of final achievement during practice). To evaluate thispossibility, we computed the performance on the final trial duringthe practice session (immediately prior to a definition beingdropped from practice) for each key-term definition. As expected,performance on the final trial of the practice session was higherfor the idea-unit group (M ¼ 66, SEM ¼ 6) than for the no-standard group (M ¼ 45, SEM ¼ 5), t(44) ¼ 2.80, p < .01. More-over, when this final performance during the practice session wasincluded as a covariate in the analysis of performance on theretention test, the effect of standard group on retention was nolonger significant, F(1,43) < 1.0. Thus, overconfidence led topremature termination of practice and lower levels of initialperformance, which persisted until the retention test. Thisevidence provides strong support for the causal mechanismsunderlying the AIM hypothesis.

2.2.4. Study timeAlthough the algorithm used to terminate study was identical

for all participants, they did have control over how long theystudied each definition during practice trials. Thus, perhaps morestudy time was used by the idea-unit group than the no-standardgroup, which in itself might be a more proximal cause for thegroup differences in retention. To evaluate this possibility, for eachparticipant we computed mean study times (in seconds) for theinitial study trials and for all restudy trials during the practicesession. Initial study time for one participant in the idea-unit groupwas removed as an outlier. Mean study times during the initialstudy trials did not significantly differ for the idea-unit group (20.4)and the no-standard group (18.0), t(43)¼ .69, nor did they differ onthe restudy trials for the idea-unit (9.1) and no-standard (11.5)groups, t(44) ¼ 1.20.

2.2.5. Trials to criterionFor each participant, we computed the number of trials that

were required for all items to reach criterion during the practicesession. Mean trials to criterion did not differ for the idea-unitgroup (30.0, SE ¼ 1.7) and for the no-standard group (31.5,SE ¼ 2.3), t(44) ¼ .63. (Given that each of the six items needed to bejudged entirely correct three times before reaching criterion,participants in both groups had to make judgments of full credit on18 trials; hence, theymade judgments lower than full credit on 12.0and 13.5 trials for the idea-unit and no-standard groups, respec-tively.) Most important, to reach the criterion of three responsesjudged as correctly recalled, students in both groups spentapproximately five practice trials per definition.

2.3. Discussion

Retention of key-term definitions was 17% greater for the idea-unit group than for the no-standard group (a 52% relativeimprovement). This difference in retention was most closely linkedto differences in overconfidence: the no-standard group judged

incorrectly recalled definitions as correct muchmore often than didthe idea-unit group. Moreover, the retention difference is not due totime-on-task differences, as the groups used similar amounts oftime studying the definitions and the same number of trials toreach criterion. Rather, although both groups used the sameamount of time trying to learn the definitions, it was the betterjudgment accuracy of the idea-unit group that ensured that moretime was used practicing unknown items. Focusing on these items(and dropping those that had been accurately judged as correctlyrecalled) during the practice session yielded higher levels ofperformance at the end of practice, which in turn persisted until theretention test.

A limitation of the method used in Study 1 is that presentingidea-unit standards not only improved judgment accuracy, but italso changed the kind of feedback. In particular, after recallinga response, participants in the idea-unit group received feedback interms of idea units and the presentation of the definition forrestudy, whereas those in the no-standard group only received thelatter kind of feedback. Given that the two forms of feedback for theidea-unit groupweremassed, it seems unlikely that any differencesbetween the groups could account for the differences in long-termretention. Nonetheless, we address this issue in Study 2.

3. Study 2

In Study 2, we evaluated the AIM hypothesis using a methodthat sidestepped the main limitation of Study 1. We capitalized onthe fact that substantial variability occurred in judgment accuracyevenwithin the groupwho used idea-unit standards. Most relevantare values pertaining to overconfidence (Fig. 1, right-most bar). Onaverage, when participants in the idea-unit group judgeda response as correct, it was actually correct 57% of the time (whichis better than the no-standard group but still reflects substantialoverconfidence). The standard deviation was large (25.0) and thescores ranged from a low of 6% all the way to an impressive 99%.Consistent with the AIM hypothesis, these individual differences inoverconfidence for the idea-unit groupwere significantly related tofinal retention (r ¼ �.85; the comparable value for the no-standardgroup was �.72, both ps < .001). Of course, the small sample(n¼ 22) may yield a biased estimate and hence requires replicationwith a larger sample.

Study 2 was a larger scale investigation of individual differencesin judgment accuracy and their relationship to retention. Becauseevery participant made idea-unit judgments (as in the idea-unitgroup in Study 1), all participants received the same kind of feed-back (with idea-unit feedback occurring immediately beforerestudy). Thus, if individual differences in judgment accuracy arerelated to retention, then Study 2 would provide convergingsupport for the AIM hypothesis and would rule out differentialfeedback as an alternative explanation for the benefits of improvedjudgment accuracy on retention.

3.1. Methods

3.1.1. Participants and materialsOne-hundred and fifty eight undergraduate students from Kent

State University participated to receive course credit in Introduc-tory Psychology. The materials were the same as in Study 1.

3.1.2. ProcedureThe procedure was the same as for the idea-unit group in Study

1, with the following exceptions. First, in Study 1, it was necessarythat participants in the idea-unit group make self-score judgmentsfollowing their idea-unit judgments, so comparison of outcomes inthe no-standard and idea-unit groups involved the same kind of

30

40

50

60

70

80

90

100

Fin

al C

ued

Recall T

est

J. Dunlosky, K.A. Rawson / Learning and Instruction 22 (2012) 271e280276

judgment (as the no-standard group only made self-score judg-ments). Given that all participants in Study 2 made idea-unitjudgments, collecting self-score judgments was unnecessary.Thus, participants in Study 2were only prompted tomake idea-unitjudgments to streamline the practice session, and analyses ofabsolute accuracy were based on these judgments rather than onself-scores. After a participant indicated that they had recalled allthe idea units for a given definition on three different trials, thatitem was removed from further practice trials.

Second, during restudy trials, some participants were shown thecorrect definition and were prompted to type in a paraphrase,whereas other participants were instructed to restudy the defini-tion (as in Study 1). For trials in which a participant did not judgea response to be entirely correct, restudy took place immediately(as in Study 1). For trials inwhich a participant judged a response ascorrect (i.e., containing all idea units), the timing of restudy wasmanipulated (immediate, delayed, or none). Study 2 included thesemanipulations because another original goal of this study was todiscover techniques to further boost retention. To our surprise,neither of these manipulations influenced judgment accuracy orretention, and when these manipulations were entered as cova-riates in the key analyses below, they did not change the results orconclusions. We can supply these null results to any interestedreader, but given the lack of effects, we do not discuss them further.

3.2. Results

3.2.1. Retention as a function of judgment accuracyBecause Study 2 focused on individual differences in judgment

accuracy, the analytic approach used here differed from theapproach used in Study 1. To evaluate the relationship betweenjudgment accuracy and final retention, we first computeda measure of judgment accuracy during the practice session. Foreach participant, we computed the actual percentage of correctidea units contained in the responses made on the trials in whichthe participant had judged a response as completely correct (i.e.,the participant had checked “yes” to every idea unit when makingthe idea-unit judgment). We then subtracted this value from 100 sothat higher values (i.e., higher percent of incorrect recall) indicatedmore overconfidence. Any values greater than 0 indicate over-confidence in their judgments, because each participant had judgedthese responses as completely correct. Similar to our approach inExperiment 1, we used this measure of overconfidence becausejudging that all idea units are present in a response is unambiguouse in this case, participants are claiming that the response is entirelyaccurate, hence any score lower than 100% must indicate over-confidence. As important, this measure of overconfidencewas mostrelevant to the control decision to drop a definition from studywhen a participant had judged it as entirely accurate three times.

According to their level of overconfidence, we separatedparticipants into five groups: percent incorrect recall between 100and 50% (n ¼ 18), 49e30% (n ¼ 23), 29-20% (n ¼ 27), 19e10%(n ¼ 38), and 9e0% (n ¼ 52).1 Participants in the 100e50% group

1 For completeness, we report two outcomes for trials in which participants didnot judge responses as entirely correct: (a) the mean percentage of idea units thatparticipants judged as being present, and (b) the percentage scored as correct.Results for the five groups of participants (from highest to lowest overconfidence)were 48.7% idea units judged as correct and only 27.5 actually correct (for the100e50% overconfidence group), 53.1 and 37.1 (49e30% group), 50.0 and 41.4(29e20 group), 54.2 and 55.0 (19e10% group), and 55.0 and 64.6 (9-0% group), allSEMs < 5.0. These outcomes indicate that participants who were most over-confident when judging their responses as entirely correct were also moreoverconfident when judging their responses as partially correct, F(4,152) ¼ 31.1,MSE ¼ .01, for the interaction between measure (judged vs. actual correct) andoverconfidence levels.

showed the highest levels of overconfidence, because when theyjudged their responses as correct their responses were actuallyincorrect on over 50 percent of the trials. By contrast, participantsin the 9-0% group showed excellent judgment accuracy. Also, notethat overall, these college students were relatively accurate (themajority of participants had overconfidence scores below 19%), butconsiderable variability did occur, which was essential for evalu-ating the relationship between absolute accuracy and retention.

To examine this relationship, we present final test performancein Fig. 2 for each of the five accuracy groups (M across allparticipants¼ 71%, SD¼ 23, range¼ 0e100), which reveals a highlysystematic outcome. Final retention of the definitions increasedmonotonically as participants’ overconfidence during the practicesession decreased. Participants whoweremost overconfident (100-50% group) retained fewer than 30% of the definitions, whereasthose who showed little overconfidence retained nearly all of thedefinitions that they had practiced. This relationship betweenoverconfidence during practice and final retention was statisticallysignificant, F(4,153) ¼ 52.50, MSE ¼ 230.29 (with overconfidencegroup as a 5-level quasi-between-subject variable), confirming theAIM hypothesis.

One explanation for the relationship between overconfidenceand retention (Fig. 2) is that participants with the higher levels ofoverconfidence (e.g., groups100e50%and49e40%) are less skilledeor capablee of learning key-termdefinitions. Thus, according to thisexplanation, their lack of ability is responsible for both their over-confidence and lower levels of final retention. We did not obtain anindependentmeasureof learning ability;nevertheless, performanceon the first test trial of the practice session does provide an index ofhow capable each participant was at learning these definitions (i.e.,prior to participants receiving further restudy trials to reach crite-rion). If differences in learning ability are responsible, then groupdifferences on the initial test trial during the practice session shouldaccount for the group differences in final retention (Fig. 2). Thishypothesis was not supported. Consider mean recall performanceon the first test trial of the practice session as a function of over-confidence group: 11.5, SEM ¼ 2.2 (100e50%), 18.5, SEM ¼ 2.0(49e30%), 21.8, SEM ¼ 2.0 (29e20%), 28.2, SEM ¼ 1.6 (19e20%), and31.7, SEM¼ 1.3 (9-0%). Althoughperformance on this initial test doessignificantly increase as overconfidence decreases, F(4,153)¼ 21.07,

0

10

20

100-50% 49-30% 29-20% 19-10% 9-0%

Overconfidence

% o

n

Fig. 2. Performance on the final retention test as a function of overconfidence.Participants in the far left group (most overconfident) judged responses as completelycorrect that were actually incorrect on 50% of the trials or more, whereas thoseparticipants on the far right (least overconfident) showed almost perfect absoluteaccuracy. Error bars are the corresponding standard error of the mean.

J. Dunlosky, K.A. Rawson / Learning and Instruction 22 (2012) 271e280 277

MSE¼ .9, the overall differences are small inmagnitude compared tothe group differences in final retention (from 32 to 89% for thehighest to lowest overconfidence, respectively; from Fig. 2). More-over, when performance on this initial test during the practicesession is used as a covariate in the analysis of overconfidence groupandfinal retention, thedifferences in the adjustedmeans (from30 to86% for the highest to lowest overconfidence) are significant,F(4,152)¼ 23.97,MSE¼ 2.0, and just as large as the differences in theunadjusted means (from Fig. 2). This evidence suggests that indi-vidual differences in initial learning ability do not account for therelationship between overconfidence and final retention.

3.2.2. Correlational analysis of judgment accuracy, finalperformance during the practice session, and final retention

According to the AIM hypothesis, greater overconfidence wasexpected to influence final retention by limiting the initial level oflearning during the practice session. To evaluate this prediction, wefirst examined the zero-order correlations among the three factors:(a) the Pearson r correlation between judgment accuracy (over-confidence) and final retention was �.75, p < .001, which isconsistent with the outcomes presented in Fig. 2; (b) the correla-tion between overconfidence and performance during the final trialof the practice session (i.e., the number of definitions correctlyrecalled during the final recall trial before reaching criterion)was �.93, p < .001, and (c) the correlation between final perfor-mance during the practice session and final retention was .71,p < .001. The more fine-grained prediction is that individualdifferences in final performance during practice will mediate therelation between overconfidence and final retention. To evaluatethis prediction, we computed the correlation between judgmentaccuracy and retention partialled on final performance duringstudy. The partial correlation (-.34, p < .001) was significantly lessthan zero but also substantially weaker than the correspondingzero-order correlation between judgment accuracy and finalretention (�.75), z ¼ 5.4, p < .05. Thus, as predicted by the AIMhypothesis, overconfidence was directly related to lower perfor-mance during the practice session, and these differences persistedthrough final retention.

3.2.3. Study timeOne explanation for the impressive relationship between judg-

ment accuracy and final retention is that the participants whoweremost overconfident were not motivated and spent less timestudying, whereas those who were more accurate tended to bemore conscientious and spent more time studying. To evaluatethese possibilities, we computed initial study time and restudytimes for each participant (for initial study times, values from threeparticipants in the 9e0% group were more than 6 SDs above themean and were excluded from this analysis).2 As shown in Fig. 3,the mean time used during the initial study phase (left panel) andduring the subsequent restudy trials (right panel) did not system-atically differ as a function of accuracy group. Consistent with theseobservations, separate one-way ANOVAs revealed no effect ofgroup for initial study time, F(4,150) ¼ 1.27, MSE ¼ 265.7, or forrestudy time, F(4,153) ¼ 1.99, MSE ¼ 585.1.

3.2.4. Trials to criterionFor each participant, we computed the number of trials that

each participant used to reach the criterion. Mean trials to criterion

2 Excluding these participants’ data from analysis of retention did not influencethe overall pattern shown in Fig. 2. Final performance in this group was still 88%and significantly higher than recall performance for the next best group (19e10%),t(85) ¼ 4.42.

did not differ among the five groups, from least (100e50%) to most(9e0%) accurate: 34 (SE¼ 3), 33 (SE¼ 3), 31 (SE¼ 2), 31 (SE¼ 2), 33(SE ¼ 1), respectively, F < 1. Also, given that participants judgedresponses as entirely correct on 18 trials, the number of trials inwhich participants judged responses as not entirely correct (thevalues above minus 18) also does not significantly differ acrossgroups.

3.3. Discussion

The AIM hypothesis received converging evidence from thisinvestigation of individual differences in judgment accuracy.College students who more accurately judged when they hadcorrectly recalled entire responses (as measured by the idea-unitjudgments) retained more definitions. This relationship washighly consistent, with retention increasing monotonically withdecreasing overconfidence (Fig. 2). As in Study 1, it cannot beexplained by differential time on task, because all groups useda similar amount of study time and a similar number of trials toreach criterion. Moreover, given that all participants received idea-unit feedback, the additional feedback received by the idea-unitgroup in Study 1 no longer appears a viable explanation for thebenefits of increased judgment accuracy on long-term retention.

4. General discussion

4.1. Absolute accuracy of students’ judgments is important foreffective learning

The AIM hypothesis was supported by converging evidenceacross two studies, one of which used an experimental approachand the other an individual differences approach. Both studiesdemonstrated a strong relationship between absolute judgmentaccuracy and long-term retention. The bottom line is that judgmentaccuracy matters a great deal for effective learning and durableretention: overconfidence led to the premature termination ofstudying some definitions (as indicated by lower levels of finalperformance during the practice session) and to lower levels ofretention. The studies also rule out a number of less interestingexplanations for the relationship between judgment accuracy andretention. First, these studies were designed to meet two auxiliaryassumptions of the AIM hypothesis: participants’ judgments wereused to allocate study decisions in a relatively effective manner(Rawson & Dunlosky, 2011) and in an identical manner acrossparticipants. Thus, any individual differences that might otherwisehave arisen in how well participants made these control decisions(e.g., Thiede, 1999) cannot account for the present findings. Second,participants did appear to comply with instructions to do their bestand were attempting to learn the definitions, in that regardless oftheir level of overconfidence, they spent a non-trivial amount oftime studying and restudying each of the definitions (rather thanquickly skimming or bypassing restudy). These outcomes indicatethat motivational factors cannot entirely explain the accuracy-retention link in the present studies. Third, and perhaps mostimpressive, students who made more accurate judgments did notmerely use more time or more trials to reach criterion. Instead, thelink between overconfidence and retentionwas partly mediated byfinal performance during the practice session (i.e., when judgmentswere used to control study). The fact that performance at the end ofthe practice session was better for students who were less over-confident further highlights the importance of making accuratejudgments during practice and also confirms the AIM hypothesis.Namely, students’ overconfidence led to poorer learning during thepractice session because the definitions that students believedwerewell learned (but were not) did not reach an objective learning

0

5

10

15

20

25

30

35

40

45

50

100-50% 49-30% 29-20% 19-10% 9-0%Overconfidence

100-50% 49-30% 29-20% 19-10% 9-0%Overconfidence

Initial Study Phase Restudy Phase

Stud

y tim

e (s

ec)

Fig. 3. Mean study times as a function of overconfidence. Lower values on the x-axis indicate more overconfidence. See text for details. Error bars are the corresponding standarderror of the mean.

J. Dunlosky, K.A. Rawson / Learning and Instruction 22 (2012) 271e280278

criterion during the practice session e by fiat, they were prema-turely dropped from practice. Performance at the end of the prac-tice session was highly related to retention, but it was theindividual differences in judgment accuracy (and in particular,overconfidence) that was responsible for this relationship andultimately yielded different levels of retention. The evidence pre-sented here highlights the scope of the problem, because finalperformance during the practice session mediated nearly all of therelationship between overconfidence and retention. Over-confidence while learning poses a major threat to student learningand achievement.

By focusing on absolute accuracy while learning complex defi-nitions, these studies also offer a unique contribution to a small butgrowing literature on the accuracy-memory link. In reviewing thisliterature, it is evident that accurate monitoring is a powerful toolthat shows much promise for improving student learning andretention. Nevertheless, many questions must be answered beforeany comprehensive recommendation should be made: When willtraining students to use accurate monitoring produce the largestgains in terms of learning efficiency and durability? In the presentstudies, participants had to learn only six definitions, yet for manyintroductory courses, many more key term definitions must beacquired for each new chapter in a textbook. When students needto learn many key-term definitions, time may be further limited,making high levels of judgment accuracy even more essential topromote efficient learning. Other questions include: Whichstudents will benefit most from training on how to use accuratemonitoring to control learning? And, at what age should studentsbe taught monitoring skills and how to use them? This list is notexhaustive, but it does begin to illustrate the scope of the problemand provides a rich agenda for future research.

4.2. Improving judgment accuracy

Although insufficient evidence is currently available to supportstrong prescriptive conclusions about training students to improvetheir monitoring accuracy, many students already use self-testingto monitor their learning. In a survey of 472 college students,Kornell and Bjork (2007) reported that 68% used self-testing tomonitor their learning. For students who already monitor theirlearning in this way, helping them to monitor more accurately willlikely benefit them. Students should find it easy to accuratelymonitor their learning of some materials, such as translationequivalents (e.g., cheval e horse) from a foreign-language course.Simply testing after a delay (by covering up the sought-afterresponse for each translation) can support highly accurate moni-toring (e.g., Dunlosky & Nelson, 1992; Rhodes & Tauber, 2011).

By contrast, students of all ages appear to struggle with accu-rately evaluating how well they have learned or understood textmaterials (for general reviews, see Dunlosky & Lipko, 2007; Thiede,Griffin, Wiley, & Redford, 2009). To illustrate, consider how wellcollege students evaluate their learning of classroom concepts likethose used in the present studies. When college students are pre-sented a key term and are asked to judge how well they havelearned the definition (using a self-score judgment, as in thepresent Study 1), they do not always explicitly attempt to retrievethe definition (Dunlosky, Rawson, & Middleton, 2005). Even whencollege students are overtly prompted to retrieve a definition (as inboth of the present studies), they are still overconfident unless theyreceive an appropriate standard of evaluation to use when makingtheir judgments. Put differently, when students are left to their owndevices, many of them use ineffective methods to monitor theirlearning, which can produce overconfidence and underachieve-ment. This problem is also likely promoted by textbooks thatinclude end-of-chapter lists of key terms, which may encouragestudents to use term familiarity as a basis for their judgments. Forstudents who actually attempt to retrieve the definitions, theywould then face the added difficulty of needing to work backthrough the chapter to locate the correct definition to compare totheir response. And even for students who do compare theiranswer to the correct definition, their self-score judgments wouldstill show considerable overconfidence. For instance, Rawson andDunlosky (2007) had college students study key-term definitions,attempt to recall them, and then make self-score judgments bycomparing their answer to the correct definition (referred to asa full-definition standard). Although the full-definition standardssomewhat reduced overconfidence as compared to using no stan-dard, college students who used the full-definition standards stilljudged their commission errors as fully or partially correct 43% ofthe time.

Thus, many students use self-testing to monitor their learning,but in some (and perhaps many) learning contexts, self-testingalone will not ensure high levels of judgment accuracy and caneven produce substantial overconfidence. How can students side-step their metacognitive illusions and become more accuratejudges of their learning and comprehension? The answer to thisquestion will likely depend on many factors, including the materialstudents are learning, how they study them, and how they will betested (e.g., Griffin, Wiley, & Thiede, 2008; Thiede et al., 2009;Thomas & McDaniel, 2007). For the key-term definitions used inthe current study (which are foundational to many introductorylevel courses), idea-unit judgments offer a partial solution in thatthey further reduce overconfidence and can support effectivelearning when made accurately. Textbooks could easily adopt this

J. Dunlosky, K.A. Rawson / Learning and Instruction 22 (2012) 271e280 279

metacognitive technology by including an appendix with key-termdefinitions broken into their basic idea units. Alternatively,students could develop their own idea-unit standards for use whenevaluating their responses (cf. Dunlosky et al., 2011).

Despite the promise of idea-unit standards for supportingstudents’ learning of key-term definitions, two caveats are relevantto their application. First, idea-unit judgments are not perfectlyaccurate in the present studies or in previous ones (Dunlosky et al.,2011; Lipko et al., 2009), so developing techniques to furtherimprove their accuracy is an important activity for educationalresearch. Second, idea-unit standards may not be as useful whenstudents evaluate their learning of lengthier or more complexmaterials, such as paragraphs, sections, or chapters of textbooks.Perhaps longer texts could be distilled to main points and impor-tant supporting ideas, so that students can compare their under-standing of the text to the main ideas. For instance, a student couldwrite a brief summary that captures their understanding, and thencompare the summary to the main ideas within the text. We leavepursuit of such possibilities for future research (for other promisingones, see Thiede et al., 2009).

4.3. Summary

Various outcomes from the present studies confirmed the AIMhypothesis and more generally demonstrated the promise ofa metacognitive approach to improving student scholarship.Certainly, many factors other than metacognitive ones willcontribute to student success, but individual students may be ableto improve their learning and retention if they are trained toaccurately monitor their learning and use it to properly controlsubsequent practice. Given the undeniable promise of thisapproach that has been established in laboratories, conductingcontrolled studies within classroom settings represents an excitingresearch agenda that will be necessary for establishing its power toenhance student achievement.

Acknowledgments

Please send correspondence to John Dunlosky, Kent StateUniversity, Psychology Department, Kent, OH, 44242; e-mail:[email protected]. Thanks toMarissa Hartwig for comments on anearlier version of this paper. This research was supported by theJames S. McDonnell Foundation 21st Century Science Initiative inBridging Brain, Mind and Behavior Collaborative Award, and by theInstitute of Education Sciences, U.S. Department of Education,through Grants #R305H050038 and #R305A080316 to Kent StateUniversity. The opinions expressed are those of the authors and donot represent views of the Institute or the U.S. Department ofEducation.

References

Atkinson, R. C. (1972). Optimizing the learning of a second-language vocabulary.Journal of Experimental Psychology, 96, 124e129. doi:10.1037/h0033475.

Begg, I. M., Martin, L. A., & Needham, D. R. (1992). Memory monitoring e how usefulis self-knowledge about memory? European Journal of Cognitive Psychology, 4,195e218. doi:10.1080/09541449208406182.

Benjamin, A. S. (2007). Memory is more than just remembering: Strategic control ofencoding, accessing memory, and making decisions. In A. S. Benjamin, &B. H. Ross (Eds.), The psychology of learning and motivation: Skill and strategy inmemory use, Vol 48 (pp. 175e223). London, UK: Academic Press. doi:10.1037/h0033475.

Bisanz, G. L., Vesonder, G. T., & Voss, J. F. (1978). Knowledge of ones own respondingand relation of such knowledge to learning. Journal of Experimental ChildPsychology, 25, 116e128. doi:10.1016/0022-0965(78)90042-5.

Bjorklund, D. F., Periss, V., & Causey, K. (2009). The benefits of youth. EuropeanJournal of Developmental Psychology, 6, 120e137. doi:10.1080/17405620802602334.

Boekaerts, M. (1997). Self-regulated learning: a new concept embraced byresearchers, policy makers, educators, teachers, and students. Learning andInstruction, 2, 161e186. doi:10.1016/S0959-4752(96)00015-1.

Boekaerts, M., & Rozendaal, J. S. (2010). Using multiple calibration indices inorder to capture the complex picture of what affects students’ accuracy offeeling of confidence. Learning and Instruction, 20, 372e382. doi:10.1016/j.learninstruc.2009.03.002.

Castel, A. D. (2007). The adaptive and strategic use of memory by older adults:evaluative processing and value-directed remembering. In A. S. Benjamin, &B. H. Ross (Eds.), The psychology of learning and motivation: Skill and strategy inmemory use, Vol. 48 (pp. 225e270). London: Academic Press. doi:10.1016/S0079-7421(07)48006-9.

Dunlosky, J., & Ariel, R. (2011). Self-regulated learning and the allocation ofstudy time. In B. Ross (Ed.), Psychology of Learning and Motivation, 54(pp. 103e140).

Dunlosky, J., Hartwig, M., Rawson, K. A., & Lipko, A. R. (2011). Improvingcollege students’ evaluation of text learning using idea-unit standards.Quarterly Journal of Experimental Psychology, 64, 467e484. doi:10.1080/17470218.2010.502239.

Dunlosky, J., Hertzog, C., Kennedy, M., & Thiede, K. (2005). The self-monitoringapproach for effective learning. Cognitive Technology, 10, 4e11.

Dunlosky, J., & Lipko, A. (2007). Metacomprehension: a brief history and how toimprove its accuracy. Current Directions in Psychological Science, 16, 228e232.doi:10.1111/j.1467-8721.2007.00509.

Dunlosky, J., & Nelson, T. O. (1992). Importance of the kind of cue for judgments oflearning (JOL) and the delayed-JOL effect. Memory & Cognition, 20, 373e380.doi:10.3758/BF03210921.

Dunlosky, J., Rawson, K. A., & Middleton, E. (2005). What constrains the accuracy ofmetacomprehension judgments? Testing the transfer-appropriate-monitoringand accessibility hypotheses. Journal of Memory and Language, 52, 551e565.doi:10.1016/j.jml.2005.01.011.

Flavell, J. H. (1979). Metacognition and cognitive monitoring: a new area ofcognitive-developmental inquiry. American Psychologist, 34, 906e911.doi:10.1037/0003-066X.34.10.906.

Griffin, T. D., Wiley, J., & Thiede, K. W. (2008). Individual differences, rereading, andself-explanation. Concurrent Processing and Cue Validity as Constraints on Met-acomprehension Accuracy, 36, 93e103. doi:10.3758/MC.36.1.93.

Hays, M. J., Kornell, N., & Bjork, R. A. (2011). .The costs and benefits of providingfeedback during learning. Psychonomic Bulletin & Review, 17, 797e801.doi:10.3758/PBR.17.6.797.

Hines, J. C., Touron, D. R., & Hertzog, C. (2009). Metacognitive influences on studytime allocation in an associative recognition task: an analysis of adult agedifferences. Psychology and Aging, 24, 462e475. doi:10.1037/a0014417.

Keren, G. (1991). Calibration and probability judgments: conceptual and method-ological issues. Acta Psychologica, 77, 217e273. doi:10.1016/0001-6918(91)90036-Y.

Kikas, E. (1998). The impact of teaching on students’ definitions and explanations ofastronomical phenomena. Learning and Instruction, 8(5), 439e454. doi:10.1016/S0959-4752(98)00004-8.

Koriat, A., & Goldsmith, M. (1996). Monitoring and control processes in the strategicregulation of memory accuracy. Psychological Review, 103, 490e517.doi:10.1037/0033-295X.103.3.490.

Koriat, A., Ma’ayan, H., & Nussinson, R. (2006). The intricate relationshipsbetween monitoring and control in metacognition: lessons for the cause-and-effect relation between subjective experience and behavior. Journal ofExperimental Psychology: General, 135, 36e69. doi:10.1037/0096-3445.135.1.36.

Kornell, N., & Bjork, R. A. (2007). .The promise and perils of self-regulated study.Psychonomic Bulletin & Review, 14, 219e224. doi:10.3758/BF03194055.

Lieberman, D. A. (1979). Behaviorism and the mind: A (limited) call for a return tointrospection. American Psychologist, 34, 319e333. doi:10.1037/0003-066X.34.4.319.

Lipko, A. R., Dunlosky, J., Hartwig, M. K., Rawson, K. A., Swan, K., & Cook, D. (2009).Using standards to improve middle-school students’ accuracy at evaluating thequality of their recall. Journal of Experimental Psychology: Applied, 15, 307e318.doi:10.1037/a0017599.

Metcalfe, J. (2009). Metacognitive judgments and control of study. Current Direc-tions in Psychology Science, 18, 159e163, 10.1111%2Fj.1467-8721.2009.01628.x.

Metcalfe, J., & Finn, B. (2008). .Evidence that judgments of learning are causallyrelated to study choice. Psychonomic Bulletin & Review, 15, 174e179.doi:10.3758/PBR.15.1.174.

Miller, G. A., Galanter, E., & Pribram, K. H. (1960). Plans and the structure of behavior.New York: Henry Holt and Company.

Nelson, T. O. (1996). Consciousness and metacognition. American Psychologist, 51,102e116. doi:10.1037/0003-066X.51.2.102.

Nelson, T. O., Dunlosky, J., Graf, E. A., & Narens, L. (1994). Utilization of meta-cognitive judgments in the allocation of study during multitrial learning.Psychological Science, 5, 207e213. doi:10.1111/j.1467-9280.1994.tb00502.x.

Nelson, T. O., & Narens, L. (1990). Metamemory: a theoretical framework and newfindings. In G. H. Bower (Ed.), The psychology of learning and motivation, Vol. 26(pp. 125e173). New York: Academic Press.

Nietfeld, J. L., Cao, L., & Osborne, J. W. (2006). The effect of distributedmonitoring exercises and feedback on performance, monitoring accuracy,and self-efficacy. Metacognition and Learning, 1, 159e179. doi:10.1007/s10409-006-9595-6.

J. Dunlosky, K.A. Rawson / Learning and Instruction 22 (2012) 271e280280

Pashler, H., Bain, P. M., Bottge, B. A., Graesser, A., Koedinger, K., McDaniel, M., et al.(2007). Organizing instruction and study to improve study learning: A practiceguide. Institute of Education Sciences Practice Guide.U.S. Department ofEducation.

Rawson, K., & Dunlosky, J. (2007). Improving students’ self-evaluation of learningfor key concepts in textbook materials. European Journal of Cognitive Psychology,19, 559e579. doi:10.1080/09541440701326022.

Rawson, K. A., & Dunlosky, J. (2011). Optimizing schedules of retrieval practice fordurable and efficient learning: How much is enough? Journal of ExperimentalPsychology: General, 37, 899e912.

Rawson, K. A. & Dunlosky J (in press). Retrieval-monitoring-feedback (RMF) tech-nique for producing efficient and durable learning. To appear in R. Azevedo(Ed.) The International Handbook of Metacognition and Learning Technologies.

Rhodes, M., & Tauber, S. (2011). The influence of delayed judgments of learning(JOLs) on metacognitive accuracy: a meta-analytic review. Psychological Bulletin,137, 131e148. doi:10.1037/a0021705.

Schmidt, R. A., & Bjork, R. A. (1992). New conceptualizations of practice e commonprinciples in 3 paradigms suggest new concepts for training. PsychologicalScience, 3, 207e217. doi:10.1111/j.1467-9280.1992.tb00029.x.

Schraw, G. (2009). A conceptual analysis of five measures of metacognitive moni-toring. Metacognition & Learning, 4, 33e45. doi:10.1007/s11409-008-9031-3.

Shin, H., Bjorklund, D. F., & Beck, E. F. (2007). The adaptive nature of children’soverestimation in a strategic memory task. Cognitive Development, 22, 197e212.doi:10.1016/j.cogdev.2006.10.001.

Son, L. K., & Metcalfe, J. (2000). Metacognitive and control strategies in study-timeallocation. Journal of Experimental Psychology: Learning, Memory, and Cognition,26, 204e221. doi:10.1037/0278-7393.26.1.204.

Son, L. K., & Sethi, R. (2006). Metacognitive control and optimal learning. CognitiveScience, 30, 759e774.

Thiede, K. W. (1999). The importance of monitoring and self-regulation duringmultitrial learning. Psychonomic Bulletin and Review, 6, 662e667. doi:10.3758/BF03212976.

Thiede, K. W., Anderson, M. C. M., & Therriault, D. (2003). Accuracy of metacognitivemonitoring affects learning of texts. Journal of Educational Psychology, 95,66e73. doi:10.1037/0022-0663.95.1.66.

Thiede, K. W., & Dunlosky, J. (1999). Toward a general model of self-regulated study:an analysis of selection of items for study and self-paced study time. Journal ofExperimental Psychology: Learning, Memory, and Cognition, 24, 1024e1037.doi:10.1037/0278-7393.25.4.1024.

Thiede, K. W., Griffin, T. D., Wiley, J., & Redford, J. S. (2009). Metacognitive moni-toring during and after reading. In D. J. Hacker, J. Dunlosky, & A. C. Graesser(Eds.), Handbook of metacognition in education (pp. 85e106). NY: Routledge.

Thomas, A. K., & McDaniel, M. A. (2007). The negative cascade of incongruentgenerative study-test. Processing in Memory and Metacomprehension, 35,668e678. doi:10.3758/BF03193305.

Winne, P. H., & Hadwin, A. F. (1998). Studying as self-regulated learning. InD. J. Hacker, J. Dunlosky, & A. C. Graesser (Eds.), Metacognition in educationaltheory and practice (pp. 277e304). Hillsdale, NJ: Erlbaum.