predicting performance - a comparison of university supervisors predictions and teacher candidates...

12

Click here to load reader

Upload: urs-franck

Post on 28-Jul-2015

40 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Predicting Performance - A Comparison of University Supervisors Predictions and Teacher Candidates Scores on a Teaching Performance Assessment - Sandholtz and Shea

Journal of Teacher Education63(1) 39 –50© 2012 American Association of Colleges for Teacher EducationReprints and permission: http://www. sagepub.com/journalsPermissions.navDOI: 10.1177/0022487111421175http://jte.sagepub.com

The assessment of teaching practice continues to be a signifi-cant issue for teacher education programs. Federal legisla-tion requires that graduates’ performance on licensing tests be included in evaluations of schools of education (Darling-Hammond, 2006), and legislation in California requires that teacher certification programs implement a performance assessment to evaluate candidates’ mastery of specified teaching performance expectations (California Commission on Teacher Credentialing [CCTC], 2006). The implementa-tion of performance assessments has prompted a range of concerns. Some concerns center on the difficulty of defining teaching and the reliability of performance assessments, but teacher educators also argue that the assessments limit the richness of their programs and harm the nature of relation-ships essential for learning (Snyder, 2009). A key concern across programs is the cost, which, combined with a lack of funding (Guaglianone, Payne, Kinsey, & Chiero, 2009; Porter, Youngs, & Odden, 2001), leads teacher educators to question whether resources could be better spent in other ways (Snyder, 2009). Some question whether performance assessments provide information beyond what university supervisors gain through their formative evaluations and classroom observations of candidates. Our aim in this research is to explore the extent to which supervisors’ per-spectives about candidates’ performance correspond with outcomes from summative performance assessments. The

study specifically examines the relationship between univer-sity supervisors’ predictions and teacher candidates’ perfor-mance on a summative assessment based on a capstone teaching event, part of the Performance Assessment for Cali-fornia Teachers (PACT). The study addresses the following questions: (a) To what extent do university supervisors accu-rately predict candidates’ total scores on a performance-based teaching assessment? (b) On which questions and categories of the assessment do university supervisors most accurately predict their candidates’ scores? and (c) Do uni-versity supervisors predict scores more accurately for high- and low-performing candidates?

Theoretical FrameworkThe theoretical framework for this study draws from research establishing the complex nature of teaching and, consequently, the challenges of assessing teaching practices. In contrast to process–product research in which effective teaching could be attributed to discrete, observable teaching

421175 JTEXXX10.1177/0022487111421175Sandholtz and SheaJournal of Teacher Education

1University of California, Irvine, USA

Corresponding Author:Judith Haymore Sandholtz, Department of Education, University of California, Irvine, 3200 Education Building, CA 92697-5500, USA Email: [email protected]

Predicting Performance: A Comparison of University Supervisors’ Predictions and Teacher Candidates’ Scores on a Teaching Performance Assessment

Judith Haymore Sandholtz1 and Lauren M. Shea1

AbstractThe implementation of teaching performance assessments has prompted a range of concerns. Some educators question whether these assessments provide information beyond what university supervisors gain through their formative evaluations and classroom observations of candidates. This research examines the relationship between supervisors’ predictions and candidates’ performance on a summative assessment based on a capstone teaching event, the Performance Assessment for California Teachers. The study, based on records for 337 teacher candidates over a 2-year period, specifically addresses the following questions: To what extent do university supervisors predict candidates’ total scores? On which questions and categories of the assessment do supervisors most accurately predict their candidates’ scores? Do supervisors predict scores more accurately for high- and low-performing candidates? The findings indicate that university supervisors’ perspectives did not always correspond with outcomes on the performance assessment, particularly for high and low performers.

Keywordspreservice education, assessment, supervision, teaching performance assessment

Page 2: Predicting Performance - A Comparison of University Supervisors Predictions and Teacher Candidates Scores on a Teaching Performance Assessment - Sandholtz and Shea

40 Journal of Teacher Education 63(1)

performances operating independent of time and place (Shulman, 1986), conceptions of effective teaching now recognize the complex, changing situations and often com-peting demands that teachers face (Darling-Hammond & Sykes, 1999; National Board for Professional Teaching Standards [NBPTS], 1999; Richardson & Placier, 2001). The core activities of teaching occur in real time, involve social and intellectual interactions, and are shaped by the students in the environment, thus increasing the complexity of the task (Leinhardt, 2001). The racial, cultural, social, and linguistic backgrounds of students shape the context in ways that require teachers to adopt a more expansive view of pedagogy. For students to experience equal educational opportunities, teachers need to conceptualize learning as a cultural process (Lee, 2007). Rather than a singular focus on student achievement, culturally relevant teachers take a more holistic approach that considers issues of moral, ethical, and personal development (Ladson-Billings, 1994, 1995). In these complex contexts, teachers must exercise professional judgment in making decisions, and their decisions are inex-tricably linked to the specific content and the particular stu-dents being taught. The unique, often problematic, situations that arise preclude formulaic solutions (NBPTS, 1999).

Teachers draw on specialized expertise in making deci-sions about their work. Expertise, considered to be applied formal knowledge (Brint, 1994), is a defining characteristic of professions and a foundation for professional judgment. The knowledge base for teachers extends beyond subject matter knowledge to include, for example, knowledge of educational aims, learners, curriculum, general pedagogy, and subject-specific pedagogy (Munby, Russell, & Martin, 2001; Shulman, 1987). Teachers apply their professional knowledge to decide what and how to teach to promote stu-dent learning. When teaching is viewed as more than the simple transmission of facts and ideas, the need for profes-sional judgment and autonomy becomes clear. Across pro-fessions, autonomy and freedom of action are necessary conditions for professionals to adapt their service to particu-lar client needs and circumstances (Brint, 1994; Friedson, 2001). In school settings, teachers must adapt their teaching to meet the diverse and changing needs of students in their classrooms. Variability of context, combined with the com-plexity of teaching, has shifted the view of the teacher to “a thinking, decision-making, reflective, and autonomous pro-fessional” (Richardson & Placier, 2001).

These changes in conceptions of effective teaching prompted dissatisfaction with traditional measures and led to increased attention to teacher assessments that acknowledge progressive, professional practices (Porter et al., 2001; Tellez, 1996). In contrast to bureaucratic forms of evaluation that suggest teachers’ work is highly prescribed and rule-governed, a professional view recognizes that teachers ana-lyze and adapt their practices (Darling-Hammond, 1986, 2001; NBPTS, 1999; Richardson & Placier, 2001). The expanded views of effective teaching coincided with increased

calls for accountability for teacher preparation programs and the performance of candidates. Professional organizations developed standards based not only on what teachers needed to know but also on what they needed to be able to do. The standards became the basis for designing assessments that would determine whether a candidate achieved the criteria contained in the standards.

Systems of teacher assessment developed by profes-sional organizations such as the Educational Testing Service (ETS), the Interstate New Teacher Assessment and Support Consortium (INTASC), and the NBPTS all feature perfor-mance-based assessments stemming from established stan-dards. The aim is to replicate what candidates encounter in a real work situation and determine competence by judging their performance in the actual tasks and activities.

Researchers report advantages and disadvantages of using performance-based teaching assessments to determine the competence of preservice candidates. One advantage is that assessments are tied to professional teaching standards that reflect a high degree of consensus about what constitutes effective teaching (Arends, 2006a). In addition, depending on the validity, reliability, and fairness of assessment sys-tems, claims about the quality of candidates have the poten-tial to be based on credible data rather than on subjective impressions (Arends, 2006b). Another key benefit is that performance assessments, in contrast to traditional forms of evaluation, include evidence from teaching practice. Rather than a system that relies on completion of coursework and pencil-and-paper licensure examinations, performance assessments provide more direct evaluation of teaching abil-ity (Mitchell, Robinson, Plake, & Knowles, 2001; Pecheone & Chung, 2006; Porter et al., 2001). Direct methods of assessment are better predictors of success in work settings than are indirect tests (Uhlenbeck, Verloop, & Beijaard, 2002). In teaching performance assessments, candidates per-form tasks that stem directly from what teachers do in their classrooms. The focus shifts from determining a candidate’s possession of knowledge and skills to determining the way in which a candidate uses his or her knowledge, skills, and dispositions in teaching and learning contexts (Darling-Hammond & Snyder, 2000).

In addition to serving an evaluative function, performance assessments can offer a professional learning opportunity for teacher candidates (Bunch, Aguirre, & Tellez, 2009; Darling-Hammond & Snyder, 2000). The ability to learn from one’s own practice is considered an important component of effec-tive teaching (NBPTS, 1999). After completing performance assessment tasks, candidates report gaining greater aware-ness of their own actions in the classroom as well as their students’ behavior, which allows them to better plan their instructional strategies (Okhremtchouk et al., 2009). Performance assessments also have the potential to inform teacher preparation programs about areas of strength and weakness in preparing candidates, possibly leading to pro-gram improvement (Darling-Hammond, 2006; Pecheone &

Page 3: Predicting Performance - A Comparison of University Supervisors Predictions and Teacher Candidates Scores on a Teaching Performance Assessment - Sandholtz and Shea

Sandholtz and Shea 41

Chung, 2006). In addition to providing information for for-mative evaluations of teacher education programs, research-ers propose that performance assessments also offer a means of evaluating the quality of teacher preparation programs for accreditation and accountability purposes (Pecheone & Chung, 2006).

The potential benefits of performance assessments are accompanied by a range of related concerns. When perfor-mance assessments are used to make summative or high-stakes decisions, the overarching challenge is ensuring validity, reliability, and fairness of the measures. Assessments of teaching performance must be tailored to particular disci-plines or levels, and substantial training is needed to achieve and maintain interrater agreement (Arends, 2006a). When performance assessments are used for multiple functions such as credentialing decisions, accreditation, program improvement, and candidate learning, additional measure-ment challenges arise. The key issues are balancing com-peting demands and ensuring that one function does not dominate and lessen the value of the measure for the other functions (Snyder, 2009).

Given the demands of developing and implementing per-formance assessments, a key concern is the significant amount of financial and human resources required. In the long run, these costs may become increasingly burdensome for both programs and candidates. Candidates express con-cern that their university coursework and student teaching practices, as well as their personal lives, suffer due to the extensive time devoted to completing the performance assessment (Okhremtchouk et al., 2009). The labor-intensive programs may take limited resources away from other important functions in teacher education programs or may lead to superficial implementation (Zeichner, 2003). When accreditation of programs is at stake, there is a danger of “turning performance-based teacher education into a purely mechanical implementation activity that has lost sight of any moral purpose and of the need . . . to ask the hard questions about what is being accomplished and for whose benefit” (Zeichner, 2003, p. 502).

A related concern is that the emphasis in teacher educa-tion programs is shifting to alignment and compliance, thus limiting the way teaching is represented in the curriculum, inhibiting consideration of other perspectives, and avoid-ing issues related to values and philosophical choices (Delandshere & Arens, 2001; Kornfeld, Grady, Marker, & Ruddell, 2007). A study of the hidden curriculum of one performance-based teacher education program concluded that superficial demonstrations of compliance with external mandates became more important than authentic intellec-tual engagement (Rennert-Ariev, 2008). Although not an intended consequence, performance assessments have the potential to lead to curriculum reduction in teacher education programs (Arends, 2006a). The close link between standards and assessment systems creates a situation in which ideas not addressed in the standards are not included in evaluations of

preservice candidates (Delandshere & Arens, 2001). For example, some researchers note that teaching standards and performance assessments do not adequately incorporate attri-butes and strategies associated with culturally relevant teach-ing (Ladson-Billings, 2000; Zeichner, 2003). In addition, performance assessments may exclude aspects of teaching that are important but not easily measured (Arends, 2006a).

PACTFollowing legislation in 1998 that required teacher prepara-tion programs to use standardized performance assessments in evaluating credential candidates, the CCTC contracted with the ETS to develop an instrument, known as the California Teacher Performance Assessment (CalTPA). Institutions could either adopt the state-developed model or develop alternate models and submit them for approval (CCTC, 2006). A key purpose of the performance assess-ments was to determine whether credential candidates had mastered the state’s teaching performance expectations. A consortium, which was initially composed of 12 universities and has expanded to more than 30 institutions, opted to design an alternative performance assessment. The consor-tium wanted to develop “an integrated, authentic, and sub-ject-specific assessment” that was “consistent with the core values of member institutions” (Pecheone & Chung, 2006, p. 22). The consortium’s model, the PACT, was pilot tested over 5 years beginning in 2002. In 2007, the consortium published a technical report summarizing validity and reli-ability studies of the model (Pecheone & Chung, 2007), and the PACT was approved by the CCTC.

The PACT assessment is modeled after the portfolio assessments of the Connecticut State Department of Education, the INTASC, and the NBPTS. The assessment includes the use of artifacts from teaching and written commentaries in which the candidates describe their teaching context, ana-lyze their classroom work, and explain the rationale for their actions. The PACT assessments focus on candidates’ use of subject-specific pedagogy to promote student learning.

The PACT program includes two key components: (a) a formative assessment based on embedded signature assess-ments that are developed by local teacher education programs and (b) a summative assessment based on a capstone teaching event. Embedded signature assessments tend to reflect local program values and be embedded into one or more courses. According to the PACT website, examples of embedded sig-nature assessments include a community study, an observa-tion of classroom management, a child case study, or a curriculum unit. Programs use embedded signature assess-ments as additional requirements or as a course assignment, but they are not yet an approved form of assessment. In con-trast to many classroom assignments, embedded signature assessments have formalized scoring criteria that are used by multiple instructors. The capstone teaching event is stan-dardized across programs and involves subject-specific

Page 4: Predicting Performance - A Comparison of University Supervisors Predictions and Teacher Candidates Scores on a Teaching Performance Assessment - Sandholtz and Shea

42 Journal of Teacher Education 63(1)

assessments of a candidate’s competency in five areas or cat-egories: planning, instruction, assessment, reflection, and academic language. Candidates plan and teach an instruc-tional unit, or part of a unit, that is videotaped. Using the video, student work samples, and related artifacts for docu-mentation, the candidates analyze their teaching and their stu-dents’ learning. Following analytic prompts, the candidates describe and justify their decisions by explaining their rea-soning and providing evidence to support their conclusions. The prompts help candidates consider how student learning is developed through instruction and how analysis of student learning informs teaching decisions during the act of teaching and upon reflection. The capstone teaching event is designed not only to measure but also to promote candidates’ abilities to integrate their knowledge of content, students, and instruc-tional context in making instructional decisions and to stimu-late teacher reflection on practice (Pecheone & Chung, 2006).

The teaching events and the scoring rubrics align with the state’s teaching standards for preservice teachers. The con-tent-specific rubrics are organized according to two or three guiding questions under the five categories identified above. For example, the guiding questions for planning in elemen-tary mathematics include the following: How do the plans support students’ development of conceptual understanding, computational/procedural fluency, and mathematical reason-ing skills? How do the plans make the curriculum accessible to the students in the class? and What opportunities do stu-dents have to demonstrate their understanding of the standards/objectives? For each guiding question, the rubric includes descriptions of performance for each of four levels. According to the implementation handbook (PACT Consortium, 2009), Level 1, the lowest level, is defined as not meeting performance standards. These candidates have some skill but need additional student teaching before they would be ready to be in charge of a classroom. Level 2 is considered an acceptable level of performance on the stan-dards. These candidates are judged to have adequate knowl-edge and skills with the expectation that they will improve with more support and experience. Level 3 is defined as an advanced level of performance on the standards relative to most beginners. Candidates at this level are judged to have a solid foundation of knowledge and skills. Level 4 is consid-ered to be an outstanding and rare level of performance for a beginning teacher and is reserved for stellar candidates. This level offers candidates a sense of what they should be aiming for as they continue to develop as teachers.

To prepare to assess the teaching events, scorers complete a 2-day training in which they learn how to apply the scoring rubrics. These sessions are conducted by lead trainers. Teacher education programs send an individual to be trained by PACT as a lead trainer, or institutions might collaborate to develop a number of lead trainers. The training empha-sizes what is used as sources of evidence, how to match evi-dence to the rubric level descriptors, and the distinctions between the four levels. Scorers are instructed to assign a

score based on a preponderance of evidence at a particular level. In addition to the rubric descriptions, the consortium developed a document that assists trainers and scorers in understanding the distinctions between levels. The docu-ment provides an expanded description for scoring levels for each guiding question and describes differences between adjacent score levels and the related evidence.

MethodContext

The data for this study were drawn from records for candi-dates in a public university’s teacher education program over a 2-year period, 2007-2009. In this program, all of the univer-sity supervisors also acted as scorers for the performance assessment, although typically not for their own advisees. In keeping with the training outlined by the PACT Consortium, the supervisors at this university participated in 2 days of training each year. A lead trainer, who works in the teacher education program and had been trained by PACT for the role, conducted the sessions. During the training, the supervi-sors scored two or three benchmark teaching events, which were provided by the PACT Consortium. Each person read a specific section of the event, assigned a score, and compared their scores with the benchmark scores. The group then dis-cussed any variations. Following the training, and before being allowed to score teaching events, the supervisors had to pass a calibration standard set by the PACT Consortium. Each supervisor’s scores on the calibration teaching event (provided each year by PACT) had to meet three criteria: (a) resulted in the same pass/fail decision, (b) included at least six exact matches out of the 11 rubric scores, and (c) did not include any scores that were two away from the predeter-mined score. After completing training for PACT scoring and passing the calibration standards, university supervisors predicted scores for their own candidates and then received their assigned assessments to score. The training, calibrating, predicting, and scoring took place within a 2-week period.

In this program, the university supervisors did not teach courses or seminars for student teachers, and they were not directly involved in preparing candidates for the performance assessment. The supervisors’ role was to provide support and guidance for student teachers in their assigned classrooms. In keeping with this role, supervisors have substantial teaching backgrounds in a specific subject or at a particular level. Over the academic year, the supervisors made ongoing, periodic classroom visits to observe student teachers in the field. After each classroom observation, the supervisors talked with the student teacher and completed a written evaluation form that documented and followed up on the content of their discus-sion. By the time supervisors made predictions about their candidates’ PACT scores, they had completed three class-room visits for each student teacher. Although university supervisors evaluate candidates’ classroom teaching, their

Page 5: Predicting Performance - A Comparison of University Supervisors Predictions and Teacher Candidates Scores on a Teaching Performance Assessment - Sandholtz and Shea

Sandholtz and Shea 43

role focuses on formative assessment. They do not assign grades for the field experience component. The program coordinator (elementary or secondary) assigns the grades for student teaching based on supervisor assessments, mentor teacher evaluations, lesson plans, and other assignments.

Data SourceThe pool included a total of 363 teacher candidates (156 multiple-subject/elementary education and 207 single-subject/secondary education). In 2007-2008, there were 152 candidates and 27 supervisors. In 2008-2009, there were 211 candidates and 32 supervisors. The records included scores on the PACT teaching event and predictive scores assigned by the university supervisor. We eliminated 26 records, 14 due to missing predictive scores and 12 because both the teaching event scores and predictive scores were assigned by the same individual. The analysis included data from 337 candidates.

The predictions and the teaching event scores included a ranking from one to four on each of 11 guiding questions that are grouped within the five categories. Therefore, the possi-ble total score ranged from 11 to 44. Table 1 summarizes the focus of the guiding questions within each category at the time of data collection. As described above, the rankings are defined as follows—Level 1: not meeting performance stan-dards, Level 2: acceptable level of performance, Level 3: advanced level of performance relative to most beginners, Level 4: outstanding and rare level of performance for a beginning teacher (PACT Consortium, 2009).

Data AnalysisWe used aggregate data of predictions and scores from the 337 candidates. Paired-samples correlations and measure-ments of frequency of distribution of difference were utilized through the statistical software program, SPSS. To assess the association between predictions and scores, a paired-samples

correlation was estimated for the total score, each of the five categories, and each of the 11 guiding questions. A correla-tion of 1.0 would indicate that supervisors predicted their candidate’s performance exactly on the PACT assessment. To determine the percentage of supervisors who did not pre-dict their candidate’s performance, we used a frequency of distribution of difference. This analysis was completed by determining the difference between the prediction and the teaching event score for each candidate’s total score, each of the five categories, and each of the 11 questions. A difference of zero would indicate that the supervisor exactly predicted the candidate’s score.

To compare the total score differences using a standard that does not expect an exact match, we used the PACT training calibration standard. In this analysis, we disaggre-gated the data and examined the predictions and scores for each candidate to determine accuracy according to the three conditions of the calibration standard: (a) the same pass/fail designation, (b) at least 6 (of 11) exact matches, and (c) all nonmatches must be within 1 point. We determined pass/fail designation by the number of Level 1 scores for indi-vidual questions, which according to the established PACT passing standard must be no more than two. We then calcu-lated the number of exact matches and matches that dif-fered by two or more. If the prediction met all three conditions of the calibration standard, it was considered accurate for our analyses. Table 2 provides examples of cases that differ from each condition yet fall within the zero-to-five accuracy range from our first level of analysis.

To examine predictions for high-performing and low-performing candidates, we placed candidates who scored 37 points and above (out of a total of 44 points) into a high-performer category (n = 22) and candidates who scored 20 points or below into a low-performer category (n = 21). We chose scores of 37 and 20 as cutoff points for two rea-sons. First, both scores, 37 and 20, fell at the end of the sec-ond standard deviation (M = 27.75, SD = 5.521) of the total scores. Second, the score meant that the candidate received a ranking on at least one question that was at the end of the rubric scale. That is, the high performers received one or more rankings of four, and the low performers received one or more rankings of one. We completed the same analyses on these subsamples to determine whether supervisors more accurately predicted their performance.

FindingsIn the following sections, we present the results for each research question and summarize the ranges, spreads, and accuracies. We examine predictions for total scores, predic-tions on individual questions and categories, and predictions for high- and low-performing candidates. We then discuss the findings and consider implications for teacher education programs.

Table 1. Focus of Guiding Questions in PACT Rubrics

Category Focus of guiding questions

Planning Establishing a balanced instructional focusMaking content accessibleDesigning assessments

Instruction Engaging students in learningMonitoring student learning during instruction

Assessment Analyzing student work from an assessmentUsing assessment to inform teaching

Reflection Monitoring student progressReflecting on learning

Academic language

Understanding language demandsSupporting academic language

Note: PACT = Performance Assessment for California Teachers. An additional question on assessment was added in 2009-2010.

Page 6: Predicting Performance - A Comparison of University Supervisors Predictions and Teacher Candidates Scores on a Teaching Performance Assessment - Sandholtz and Shea

44 Journal of Teacher Education 63(1)

Research Question 1: Total Score PredictionsThe correlation between predictions and total scores for candidates over the 2-year period was .289. Table 3 shows the distribution of difference for scores in 5-point spreads and the percentage of predictions in those spreads that met the training calibration standard for accuracy.

Of the 337 predictions, 22 (6.5%) matched the total scores, and 194 of the predictions (57%) were within 1 to 5 points of the total score. However, not all of these predic-tions would be considered accurate according to the PACT calibration standard. Of the 22 cases in which the predicted and total scores matched, all had the same pass/fail designa-tion. But 4 cases did not include at least six exact matches (Condition b) and 1 case had nonmatches greater than 1 point as well as fewer than six exact matches (Conditions c and b). Of the 194 cases in which the prediction and the total scores differed from 1 to 5 points, 65 cases did not meet the calibra-tion standard. Of these 65, 6 did not have the same pass/fail designation, 52 did not include at least six exact matches, and 19 had nonmatches greater than 1 point. Twelve cases failed to meet two of the three conditions. Consequently, of

the 216 cases in which predictions and total scores were within 0 to 5 points of each other, 146 (67.6%) met all three conditions of the calibration standard and 70 (32.4%) did not. There were no cases in which the prediction and the score matched exactly for each of the 11 questions. However, there were 8 cases in which the predictions and scores matched for 10 of the 11 questions.

When we use the calibration standard as a measure of accuracy, 43.2% of the total 337 predictions are within the accurate range. Predictions that did not meet the calibration standard for accuracy (n = 191) were not overwhelming in one direction but instead split between over- and underpre-dictions. Of the 191 cases, 57.6% were underpredictions and 39.8% were overpredictions. In 2.5% of these cases, the pre-diction and total score matched, but, as described above, other conditions were not met.

Results for tested subgroups (by year and program type) followed a similar trend of low correlations between predic-tions and scores. For the 2007-2008 and the 2008-2009 sub-groups, the total score correlations were .244 and .320, respectively. The total score correlations by program type were .252 for the elementary education group and .347 for the secondary education group. All of the correlations were significant at the .01 level.

Research Question 2: Individual Question and Category PredictionsTable 4 shows the correlations, percentage of accurate pre-dictions, and the range of inaccurate predictions for each of the 11 guiding questions. Correlations between predictions and scores on the questions ranged from .110 to .242. The accurate predictions ranged from 41.5% on Question 6 (analyzing student work from an assessment) to 54.3% on Question 10 (understanding language demands). Questions 1, 2, 3, and 6 had one prediction spanning the full range of

Table 2. Examples of Cases That Differ from the Standard

Example: Prediction and score for each question and total

PACT calibration criteria Reason Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Total

Must have same pass/fail designation

Prediction of passing but candidate failed

Prediction 2 2 2 2 2 2 2 2 2 2 2 22Score 2 2 2 2 1a 1a 1a 2 1a 1a 2 17

Prediction of failing but candidate passed

Prediction 2 1a 2 2 2 2 1a 2 2 1a 2 19 Score 3 2 2 3 2 2 2 2 2 2 2 24

Must have six (of 11) exact matches Fewer than six exact matches

Prediction 2 2 2 3 2a 2a 2 2a 2a 2a 3 24 Score 3 3 3 2 2a 2a 1 2a 2a 2a 2 24

All nonmatches must be within 1 point

Nonmatch greater than 1 point

Prediction 3 4 3 3 3 2 1a 3 2 3 3 30Score 3 3 3 3 2 3 3a 3 3 2 2 30

Note: PACT = Performance Assessment for California Teachers. A fail designation results from more than two scores of Level 1.aDenotes the reason the criterion was not met.

Table 3. Distribution of Difference for Total Score

Difference between predictions and scores Frequency Percentage

Percentage that met calibration

standard

0 22 6.5 5.01-5 194 57.0 38.26-10 95 28.1 011-15 20 5.9 016-20 5 1.8 021 and above 1 0.3 0

Note: Total analyses completed with 337 candidates.

Page 7: Predicting Performance - A Comparison of University Supervisors Predictions and Teacher Candidates Scores on a Teaching Performance Assessment - Sandholtz and Shea

Sandholtz and Shea 45

the rubric, meaning a difference of 3 points between the prediction and the score. In all cases, these 3-point differ-ences were overpredictions, suggesting the university super-visor predicted exemplary performance, but the scorer evaluated the assessment on that question as not meeting standards.

Table 5 shows the correlations, percentage of accurate predictions, and the range of inaccurate predictions for each of the five categories. As summarized earlier in Table 1, the 11 guiding questions are grouped into five categories: plan-ning, instruction, assessment, reflection, and academic lan-guage. Correlations between predictions and scores on the categories ranged from .113 to .269. The predictions for the categories had lower accuracy than predictions for individ-ual questions. The accurate predictions ranged from 17.5% (planning) to 38% (academic language).

Research Question 3: Predictions for High-Performing and Low-Performing CandidatesAs with the total group of candidates, the analysis of data for the high- and low-performing groups focused on the accu-racy of predictions for total scores, individual questions, and

categories. For the high-performing candidates (total PACT score 37 or above) and the low-performing candidates (total PACT score 20 or below), the university supervisors were no more likely to accurately predict their scores than the other candidates’ scores. The percentage of accurate predic-tions and the correlations between predictions and scores were low.

High-performing candidates. For the 22 high performers, correlations between predictions and scores ranged from .056 to .276 for the categories and from –.364 to .404 for the individual questions (see Table 6). None of the correlations were statistically significant at the .05 level. The frequency of accurate predictions for the individual questions ranged from one accurate prediction (4.5%) for Question 6 (analyz-ing student work from an assessment) to nine accurate pre-dictions (40.9%) for Question 4 (engaging students in learning). The frequency of accurate predictions for the cat-egories, consisting of two or three questions, ranged from one accurate prediction (4.5%) for Assessment to six accu-rate predictions (27.3%) for Reflection.

Predictions for only 5 of the 22 high-performing candi-dates would be considered accurate according to the training calibration standard. In 1 of those 5 cases, the prediction and the total score matched with nine of the predictions and scores out of the 11 individual questions matching. For all 22 high performers, the supervisors accurately predicted that the candidate would pass the assessment. However, in the 17 cases that did not meet the calibration standard, 13 had fewer than six exact matches on the individual questions and also had nonmatches greater than 1 point. Four predictions did not meet the calibration standard due to a mismatch on one of the three conditions. In all 17 cases, the university supervisors underpredicted total scores for the high perform-ers. In 2 cases, the supervisors underpredicted the total score by almost half of the possible score of 44. More specifically, one underpredicted the total score by 19 points and the other

Table 4. Correlations, Percentage Accurate Predictions, and Range for Predictions and Scores

Question

Correlation between predictions and

scoresPercentage of

accurate predictions

Range of difference: Predictions and

scores

Q1 Planning: Establishing a balanced instructional focus .149*** 51.0 3Q2 Planning: Making content accessible .201*** 46.9 3Q3 Planning: Designing assessments .136*** 45.1 3Q4 Instruction: Engaging students in learning .242*** 50.1 2Q5 Instruction: Monitoring student learning during instruction .225*** 49.0 2Q6 Assessment: Analyzing student work from an assessment .090* 41.5 3Q7 Assessment: Using assessment to inform teaching .110** 52.5 2Q8 Reflection: Monitoring student progress .143*** 48.4 2Q9 Reflection: Reflecting on learning .228*** 47.8 2Q10 Academic language: Understanding language demands .205*** 54.3 2Q11 Academic language: Supporting academic language development .240*** 54.0 2

*p < .10. **p < .05. ***p < .01.

Table 5. Correlations and Percentage Accurate Predictions

Category

Correlation between predictions

and scores

Percentage of accurate predictions

Category P: Planning .219*** 17.5Category I: Instruction .269*** 32.6Category A: Assessment .113** 30.3Category R: Reflection .250*** 28.5Category D: Academic

language.268*** 38.0

**p < .05. ***p < .01.

Page 8: Predicting Performance - A Comparison of University Supervisors Predictions and Teacher Candidates Scores on a Teaching Performance Assessment - Sandholtz and Shea

46 Journal of Teacher Education 63(1)

by 21 points. A total of 22 candidates received scores that identified them as high performers, according to our cutoff point of 37 or above, yet, in only 2 of those cases did the supervisors similarly predict this level of high performance.

Low-performing candidates. For the 21 low performers, cor-relations between predictions and scores ranged from –.311 to .718 for the 11 questions; only one correlation was statisti-cally significant (Question 10: understanding language demands). Table 7 shows these results. Correlations for the five categories ranged from –.329 to .504 with one statisti-cally significant correlation (Reflection). The frequency of accurate predictions for individual questions ranged from 5 accurate predictions (23.8%) for Question 10 (understanding language demands) to 10 accurate predictions (47.6%) for Question 5 (monitoring student learning during instruction). The frequency of accurate predictions for the categories ranged from 2 accurate predictions (9.5%) for Assessment to 6 accurate predictions (28.6%) for Instruction.

For most of the low-performing candidates (17/21 or 81%), the supervisors did not predict low performance according to our cutoff point of 20. Using the calibration standard, only 4 of the 21 cases would be considered accu-rate. In 13 of the 21 cases, the pass/fail designations differed. Of the 17 inaccurate predictions, 7 did not meet all three con-ditions of the standard, 5 did not meet two conditions, and 5 did not meet one condition. In 16 of these 17 cases, the university supervisors overpredicted their candidates’ total score, and in 1 case, the supervisor underpredicted the total score. In the most extreme case, the supervisor overpredicted the low-performing candidate’s total score by 19 out of a possible 44 points.

Discussion

Our findings indicate that in the majority of cases (63.5%), supervisors’ predictions were within 5 points of the candi-date’s total score on the PACT teaching event. However, in some cases, predictions and scores were within 5 points but differed on other dimensions such as pass/fail designation. When we use the calibration standard as a measure of accu-racy, 43.2% of the total 337 predictions are within the accu-rate range. The predictions that did not meet the calibration standard for accuracy were split between over- and under-predictions. For the total candidate group, supervisors did not have more accurate predictions on any individual ques-tions or categories. No one question or category stood out as more or less accurate in prediction-score matching, and supervisors did not consistently over- or underpredict for a particular question or category. Approximately half of the supervisors accurately predicted performance on each indi-vidual question.

The most surprising differences occurred in prediction-score matching for high and low performers. Given the com-monly held view that it is easiest to identify students on either end of the continuum, we anticipated more agreement for high and low performers. We thought that the supervi-sors, who observe and evaluate candidates in the classroom, would be in a prime position to predict which preservice teachers would perform particularly well or poorly on a teaching performance assessment. But supervisors did not provide closer predictions for those candidates. In the high-performing group, most supervisors underpredicted their candidates’ performance, and, in 2 cases, the difference was

Table 6. Correlations, Percentage Accurate Predictions, and Range for Predictions and Scores for High-Performing Candidates (n = 22)

Question or category

Correlation between predictions and

scores

Frequency/percentage of

accurate predictions

Range of difference: Predictions and

scores

Q1 Planning: Establishing a balanced instructional focus .279 8/36.4 2Q2 Planning: Making content accessible !.364* 5/22.7 2Q3 Planning: Designing assessments .368* 7/31.8 2Q4 Instruction: Engaging students in learning .212 9/40.9 2Q5 Instruction: Monitoring student learning during instruction !.089 6/27.3 2Q6 Assessment: Analyzing student work from an assessment .404* 1/4.5 2Q7 Assessment: Using assessment to inform teaching !.162 6/27.3 2Q8 Reflection: Monitoring student progress .058 7/31.8 2Q9 Reflection: Reflecting on learning .176 11/50.0 2Q10 Academic language: Understanding language demands .240 7/31.8 2Q11 Academic language: Supporting academic language

development.209 8/36.4 2

Category P: Planning .188 3/13.6 Category I: Instruction .133 2/9.1 Category A: Assessment .056 1/4.5 Category R: Reflection .085 6/27.3 Category D: Academic language .276 5/22.7

*p < .10.

Page 9: Predicting Performance - A Comparison of University Supervisors Predictions and Teacher Candidates Scores on a Teaching Performance Assessment - Sandholtz and Shea

Sandholtz and Shea 47

nearly half of the possible total score. The supervisors cor-rectly predicted that candidates would pass the assessment, but, in only 2 of the 22 cases did supervisors predict a total score that identified candidates as high performers in rela-tionship to our cutoff point of 37. In the low-performing group, the majority of supervisors overpredicted their candi-dates’ total scores, with one supervisor overpredicting by 19 points. In only 4 of the 21 cases did supervisors predict a total score that identified candidates as low performers in relationship to our cutoff point of 20. For the majority of the high and low performers, a group of 43 out of 337 candi-dates, their supervisors did not identify them as the excep-tional candidates who would either excel or fail.

In addition to differences in total scores, the range of differences between predictions and scores on individual questions, particularly for high and low performers, was surprising. Differences of 1 point on a question are not strik-ing. But differences of 2 and 3 points reflect highly contrast-ing perspectives about a candidate’s skills and performance in a particular area. A 2-point range on a question is the dif-ference between “not passing” and an “advanced level of performance” or between “adequate,” which is the lowest passing ranking, and an “outstanding performance.” A 3-point overprediction means that the supervisor predicted a score reserved for stellar candidates, yet the candidate received a failing score. It is difficult to comprehend that a supervisor would view a candidate as outstanding in an area in which the candidate fails on the performance assessment. Because the supervisors train and serve as scorers, they are familiar with the format, requirements, and standards of the performance assessment. The differences would not arise

from a lack of knowledge about the performance assessment itself. The differences also do not appear to be because of tendencies of a particular supervisor or scorer. Among the high performers with the greatest differences between pre-dictions and total scores, none had the same university super-visor or the same scorer for their teaching event. Similarly, none of the low performers with the greatest differences had the same supervisor or scorer.

We propose that the discrepancies between the predic-tions and scores in our study stem from three differences in the tasks of the supervisors and scorers. First, supervisors and scorers draw on different data sources. Whereas supervi-sors make predictions based on formative evaluations and classroom observations of candidates, scorers make judg-ments based on teaching artifacts and written commentaries. Supervisors in this program are not directly involved in pre-paring candidates for the performance assessment and do not review drafts of written commentaries. Their predictions stem from their observations of classroom teaching and dis-cussions with candidates about their plans and instructional practice but not from candidates’ written analyses of their teaching. Candidates who may be effective classroom teach-ers may not be as skilled in writing about their instructional practice. Moreover, some candidates may aim to achieve a high score on the performance assessment whereas others may make student teaching the priority. Second, supervisors observe and gauge progress over time, whereas scorers make a single judgment at 1 point in time. Supervisors focus on formative evaluations and feedback to help candidates improve whereas scorers make a summative assessment. Multiple scorers may consistently agree on the scores for a

Table 7. Correlations, Percentage Accurate Predictions, and Range for Predictions and Scores for Low-Performing Candidates (n = 21)

Question or category

Correlation between predictions and

scores

Frequency/percentage of

accurate predictions

Range of difference: Predictions and

scores

Q1 Planning: Establishing a balanced instructional focus !.238 9/42.9 3Q2 Planning: Making content accessible !.104 7/33.3 3Q3 Planning: Designing assessments !.274 9/42.9 3Q4 Instruction: Engaging students in learning .229 9/42.9 2Q5 Instruction: Monitoring student learning during instruction .085 10/47.6 2Q6 Assessment: Analyzing student work from an assessment !.311 6/28.6 2Q7 Assessment: Using assessment to inform teaching .171 6/28.6 2Q8 Reflection: Monitoring student progress .067 7/33.3 2Q9 Reflection: Reflecting on learning .494 11/52.4 2Q10 Academic language: Understanding language demands .718*** 5/23.8 1Q11 Academic language: Supporting academic language development .077 11/52.4 2Category P: Planning !.329 4/19 Category I: Instruction .314 6/28.6 Category A: Assessment !.113 2/9.5 Category R: Reflection .504** 4/19 Category D: Academic language .476 4/19

**p < .05. ***p < .01.

Page 10: Predicting Performance - A Comparison of University Supervisors Predictions and Teacher Candidates Scores on a Teaching Performance Assessment - Sandholtz and Shea

48 Journal of Teacher Education 63(1)

performance assessment, but supervisors may have differing perspectives from observing the ups and downs in a candi-date’s overall progression. Third, supervisors assess candi-dates’ teaching in active classrooms with changing situations whereas scorers view a bounded, preselected segment of a class. What supervisors view in observations may not corre-spond with what scorers view in the performance assessment. For example, supervisors may observe ongoing classroom management problems that interfere with student learning whereas scorers see only minor issues in a video clip. For the performance assessment, candidates may choose among multiple teaching segments to submit, but candidates are unable to select what the supervisor observes during class-room visits. Consequently, supervisors tend to have more opportunity to observe how candidates respond to immediate situations that arise and how they adapt instruction to the particular context.

Conclusion and ImplicationsIn this study, university supervisors’ perspectives about their candidates did not always correspond with outcomes on the PACT teaching event, a summative performance assessment. Most of the candidates with the highest and lowest scores on the assessment were not those for whom the supervisors anticipated outstanding or poor performance. As described above, we posit that the primary reason that predictions did not match performance is not lack of knowledge about the assessment or the state’s teaching performance expectations but rather differences in the roles of supervisors and scorers.

This study highlights issues that hold implications for practice, policy, and research on assessing preservice teach-ers’ qualifications. The increasing emphasis on performance assessments is changing the role of the university supervisor in evaluating candidates’ qualifications. Concerns about the validity and reliability of student teaching observations sug-gest that reliance on supervisor’s evaluations of candidates for making summative judgments is problematic. Observations may be conducted too infrequently, training of supervisors may be insufficient to achieve interrater agreement, and observation forms may not be tailored to specific disciplines or levels (Arends, 2006b). Researchers report that summa-tive judgments made from student teaching observation forms are unable to differentiate among various levels of effectiveness and that 95% of candidates receive a grade of “A” in student teaching (Arends, 2006b). However, in devel-oping and implementing more discriminating forms of assessment, we may be eliminating, rather than lessening, the use of supervisors’ observations in evaluations. Performance assessments and supervisor perspectives may provide different, yet equally valuable, information for overall assessments of candidates. In making classroom vis-its, supervisors gain a firsthand and in-depth view of the specific school context in which preservice candidates are

teaching, and they observe student teachers’ improvement and progress over time in this context. Supervisors, for example, may have more opportunities to observe how can-didates interact with students from varying cultural and lin-guistic backgrounds and be in a stronger position to determine how candidates identify the language demands of learning tasks relative to students’ current levels of academic lan-guage proficiency. Supervisors also may be better positioned to observe qualities related to the caring aspect of teacher–student relationships (Noddings, 1991) and the psycholog-ical and moral dimensions of teaching (Fenstermacher & Richardson, 2005). Ladson-Billings (2000) questions how a sense of caring and cultural solidarity can be exhibited in an assessment and what pieces of evidence would demonstrate the connection between a teacher and his or her students. During follow-up discussions with supervisors after obser-vations, candidates can discuss specific challenges, receive prompt feedback from supervisors, and engage in immediate reflection on their teaching. During later observations, super-visors can gauge if candidates make adaptations based on their reflections. The PACT assessment also incorporates contextual factors but with different forms of documenta-tion. In the PACT teaching event, candidates describe their context and explain their selected instructional segment in relationship to the context. The written commentary offers evidence about how candidates analyze their own teaching after more in-depth reflection and how they use their analy-ses to plan for future instruction. But, due to the nature of the assessment, there is no mechanism for determining whether candidates actually implement the adaptations they propose. Given the financial and human resources required for both performance assessments and university supervisors, teacher education programs may be faced with difficult choices. Before we adopt practices or policies that either intention-ally, or unintentionally, eliminate supervisors’ perspectives about candidates’ effectiveness, we need to carefully con-sider and closely examine the trade-offs.

Our findings also underscore the value of using multiple methods in assessing the qualifications and competence of preservice candidates. Performance assessments offer numerous benefits in comparison with traditional evalua-tions, but the credibility of performance assessments for licensing decisions is an ongoing concern (Arends, 2006b). In addition, unintended consequences are emerging as per-formance assessments are being mandated and implemented (Delandshere & Arens, 2001; Kornfeld et al., 2007; Rennert-Ariev, 2008). Multiple sources of information about a candi-date stand to contribute to a more thorough assessment of effectiveness. The limitations of particular assessment strat-egies can be overcome by using broad-based assessment systems that include multiple sources of evidence from multiple evaluators. For a comprehensive assessment of candidates’ progress, strategies that “appreciate the com-plexity of teaching and learning and that provide a variety of

Page 11: Predicting Performance - A Comparison of University Supervisors Predictions and Teacher Candidates Scores on a Teaching Performance Assessment - Sandholtz and Shea

Sandholtz and Shea 49

lenses on the process of learning to teach” are needed (Darling-Hammond, 2006, p. 120). Candidates may appear more, or less, effective during classroom observations than in their videotaped segment and accompanying commentary in a performance assessment. In addition, candidates with strong writing skills may have an advantage in analyzing and describing their teaching practice. In the face of limited time and varied personal responsibilities, some candidates may be inclined to devote more time and attention to their classroom teaching or alternately to the summative assessment.

Research on evaluation of practicing teachers similarly highlights the complexity of teaching and the need for ana-lyzing and documenting effective teaching through a range of strategies. Peterson (1987, 2000) reports that multiple measures tap different aspects of teacher quality. In addi-tion, multiple evaluators contribute a range of perspectives. “For some questions and situations, the perspectives of people in different roles are needed to recognize satisfac-tory or outstanding work” (Peterson, 2000, p. 5). If perfor-mance assessments become a single, high-stakes measure of preservice teacher qualifications and teacher education outcomes, multiple perspectives will be lost. In that case, supervisors’ insights about a candidate’s effectiveness would have little bearing on decisions about progress or possible remediation.

This study holds implications for future research about assessment of preservice candidates’ qualifications. First, the lack of agreement about high and low performers in this study contradicts common wisdom about identifying stellar and struggling candidates and is an area that warrants further study. Why did these differences occur? What is the correla-tion between outstanding or failing scores on a performance assessment and other measures of candidates’ abilities? Second, to make overall judgments about preservice teach-ers’ abilities, we need to know more about the relationship between specific measures and particular aspects of teaching effectiveness in the student teaching phase. Some strategies may offer valuable evidence on one component but not another. In addition, strategies that may be informative in assessing practicing teachers may not be as relevant for eval-uating preservice teachers. Continuing research on the con-tributions and limits of performance assessments and other strategies will be important in determining how to evaluate and foster candidates’ professional growth during teacher preparation programs.

Declaration of Conflicting InterestsThe author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

FundingThe author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Arends, R. I. (2006a). Performance assessment in perspective: His-tory, opportunities, and challenges. In S. Castle & B. S. Shaklee (Eds.), Assessing teacher performance: Performance-based assessment in teacher education (pp. 3-22). Lanham, MD: Rowman & Littlefield Education.

Arends, R. I. (2006b). Summative performance assessments. In S. Castle & B. S. Shaklee (Eds.), Assessing teacher perfor-mance: Performance-based assessment in teacher education (pp. 93-123). Lanham, MD: Rowman & Littlefield Education.

Brint, S. (1994). In an age of experts: The changing role of pro-fessionals in politics and public life. Princeton, NJ: Princeton University Press.

Bunch, G. C., Aguirre, J. M., & Tellez, K. (2009). Beyond the scores: Using candidate responses on high stakes performance assessment to inform teacher preparation for English learners. Issues in Teacher Education, 18(1), 103-128.

California Commission on Teacher Credentialing. (2006). Sum-mary of commission responsibilities for major provisions of SB 1209. Retrieved from http://www.ctc.ca.gov/educator-prep/SB1209/default.html

Darling-Hammond, L. (1986). A proposal for evaluation in the teach-ing profession. Elementary School Journal, 86(4), 531-551.

Darling-Hammond, L. (2001). Standard setting in teaching: Changes in licensing, certification, and assessment. In V. Richardson (Ed.), Fourth handbook of research on teaching (pp. 751-776). Washington, DC: American Educational Research Association.

Darling-Hammond, L. (2006). Assessing teacher education: The usefulness of multiple measures for assessing program out-comes. Journal of Teacher Education, 57(2), 120-138.

Darling-Hammond, L., & Snyder, J. (2000). Authentic assessment of teaching in context. Teaching and Teacher Education, 16(5-6), 523-545.

Darling-Hammond, L., & Sykes, G. (1999). Teaching as the learn-ing profession: Handbook of policy and practice. San Francisco, CA: Jossey-Bass.

Delandshere, G., & Arens, S. A. (2001). Representations of teach-ing and standards-based reform: Are we closing the debate about teacher education? Teaching and Teacher Education, 17, 547-566.

Fenstermacher, G. D., & Richardson, V. (2005). On making deter-minations of quality in teaching. Teachers College Record, 107(1), 186-215.

Friedson, E. (2001). Professionalism, the third logic: On the prac-tice of knowledge. Chicago, IL: University of Chicago Press.

Guaglianone, C. L., Payne, M., Kinsey, G. W., & Chiero, R. (2009). Teaching performance assessment: A comparative study of implementation and impact among California State University campuses. Issues in Teacher Education, 18(1), 129-148.

Kornfeld, J., Grady, K., Marker, P. M., & Ruddell, M. R. (2007). Caught in the current: A self-study of state-mandated compliance in a teacher education program. Teachers College Record, 109(2), 1902-1930.

Page 12: Predicting Performance - A Comparison of University Supervisors Predictions and Teacher Candidates Scores on a Teaching Performance Assessment - Sandholtz and Shea

50 Journal of Teacher Education 63(1)

Ladson-Billings, G. (1994). The dreamkeepers: Successful teachers of African American students. San Francisco, CA: Jossey-Bass.

Ladson-Billings, G. (1995). Toward a theory of culturally relevant pedagogy. American Educational Research Journal, 32(3), 465-491.

Ladson-Billings, G. (2000). The validity of National Board for Pro-fessional Teaching Standards (NBPTS)/Interstate New Teacher Assessment and Support Consortium (INTASC) assessments for effective urban teachers. Washington, DC: U.S. Department of Education.

Lee, C. (2007). Culture, literacy, and learning: Taking bloom in the midst of the whirlwind. New York, NY: Teachers College Press.

Leinhardt, G. (2001). Instructional explanations: A commonplace for teaching and location for contrast. In V. Richardson (Ed.), Fourth handbook of research on teaching (pp. 333-357). Wash-ington, DC: American Educational Research Association.

Mitchell, K. J., Robinson, D. Z., Plake, B. S., & Knowles, K. T. (2001). Testing teacher candidates: The role of licensure tests in improving teacher quality. Washington, DC: National Acad-emy Press.

Munby, H., Russell, T., & Martin, A. K. (2001). Teachers knowl-edge and how it develops. In V. Richardson (Ed.), Fourth hand-book of research on teaching (pp. 877-904). Washington, DC: American Educational Research Association.

National Board for Professional Teaching Standards. (1999). What teachers should know and be able to do. Arlington, VA: Author.

Noddings, N. (1991). The challenge to care in schools: An alterna-tive approach to education. New York, NY: Teachers College Press.

Okhremtchouk, I., Seiki, S., Gilliland, B., Atch, C., Wallace, M., & Kato, A. (2009). Voices of pre-service teachers: Perspec-tives on the Performance Assessment for California Teachers (PACT). Issues in Teacher Education, 18(1), 39-62.

Pecheone, R. L., & Chung, R. R. (2006). Evidence in teacher edu-cation: The Performance Assessment for California Teachers (PACT). Journal of Teacher Education, 57(1), 22-36.

Pecheone, R. L., & Chung, R. R. (2007). PACT technical report. Stanford, CA: PACT Consortium.

Performance Assessment for California Teachers Consortium. (2009). Implementation handbook. Retrieved from http://www .pacttpa.org/_main/hub.php?pageName=Implementation _Handbook

Peterson, K. (1987). Teacher evaluation with multiple and variable lines of evidence. American Educational Research Journal, 24(2), 311-317.

Peterson, K. (2000). Teacher evaluation: A comprehensive guide to new directions and practices (2nd ed.). Thousand Oaks, CA: Corwin Press.

Porter, A., Youngs, P., & Odden, A. (2001). Advances in teacher assessments and their use. In V. Richardson (Ed.), Handbook of research on teaching (4th ed., pp. 259-297). Washington, DC: American Educational Research Association.

Rennert-Ariev, P. (2008). The hidden curriculum of performance-based teacher education. Teachers College Record, 110(1), 105-138.

Richardson, V., & Placier, P. (2001). Teacher change. In V. Richardson (Ed.), Fourth handbook of research on teach-ing (pp. 905-947). Washington, DC: American Educational Research Association.

Shulman, L. S. (1986). Paradigms and research programs in the study of teaching: A contemporary perspective. In M. Wittrock (Ed.), Handbook of research on teaching (3rd ed., pp. 3-36). New York, NY: Macmillan.

Shulman, L. S. (1987). Knowledge and teaching: Foundations of the new reform. Harvard Educational Review, 57, 1-22.

Snyder, J. (2009). Taking stock of performance assessments in teaching. Issues in Teacher Education, 18(1), 7-11.

Tellez, K. (1996). Authentic assessment. In J. Sikula (Ed.), The hand-book of research in teacher education (2nd ed., pp. 704-721). New York, NY: Macmillan.

Uhlenbeck, A. M., Verloop, N., & Beijaard, D. (2002). Require-ments for an assessment procedure for beginning teachers: Implications from recent theories on teaching and assessment. Teachers College Record, 104(2), 242-272.

Zeichner, K. M. (2003). The adequacies and inadequacies of three current strategies to recruit, prepare, and retain the best teachers for all students. Teachers College Record, 105(3), 490-519.

About the AuthorsJudith Haymore Sandholtz is an associate professor in the Department of Education at the University of California, Irvine. Her research focuses on teacher professional development, teacher education, school–university partnerships, and technology in education.

Lauren M. Shea is a doctoral candidate in the Department of Education at the University of California, Irvine. Her research interests include professional development for teachers of English Language Learners and online blended methodologies in teacher education.