relationships between principals' ratings of teacher performance and student achievement

13
Journal of Personnel Evaluation in Education 4:189-201, 1990 ~© 1990 Kluwer Academic Publishers, Manufactured in the United States of America Relationships Between Principals' Ratings of Teacher Performance and Student Achievement RICHARD P. MANATT School Improvement Model, EO05 Lagomarcino Hall, Iowa State University, Ames, IA 50011 BRUCE DANIELS Tupelo Public Schools, 201 S. Green Street, Tupelo, MS 38801 A principal's judgment must be based on observations, formal and informal, of teachers' and students' behaviors while teaching and learning is going on and comparisons between those behaviors and the principal's own conception or model of effective teacher behavior. Reasonable as this procedure seems, the research clearly indicates that it is not working. Why not? Is it because principals are not very good observers, because their conceptions or models of effective teacher behavior are erroneous, or because, although they possess these abilities, for some reason they cannot or do not use them? (Medley & Coker, 1987). The lament of Medley and Coker raises key questions that ought to be driving the research on teacher performance evaluation for the next decade. 1. Can principals and other supervisors be trained to do better classroom observations? 2. Can they be helped by more accurate conceptions or models of effective teaching? 3. Can they be persuaded to devote the amount of time and effort to teacher performance evaluation necessary to put this knowledge to use? 4. Would principals do a better job of teacher performance evaluation if their own performance ratings depended on it? Empirical tests of the accuracy of teacher ratings from 1921 through 1959 (a total of 11) all reached the same answer: the correlations between the average principal's ratings of teacher performance and direct measures of teacher effectiveness were near zero (Medley & Coker, 1987). In a recent study involving 46 principals and 322 teachers, Medley and Coker (1987) provided no support whatever for the widely held belief that the average

Upload: richard-p-manatt

Post on 06-Jul-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Journal of Personnel Evaluation in Education 4:189-201, 1990 ~© 1990 Kluwer Academic Publishers, Manufactured in the United States of America

Relationships Between Principals' Ratings of Teacher Performance and Student Achievement

RICHARD P. MANATT School Improvement Model, EO05 Lagomarcino Hall, Iowa State University, Ames, IA 50011

BRUCE DANIELS Tupelo Public Schools, 201 S. Green Street, Tupelo, MS 38801

A principal's judgment must be based on observations, formal and informal, of teachers' and students' behaviors while teaching and learning is going on and comparisons between those behaviors and the principal's own conception or model of effective teacher behavior. Reasonable as this procedure seems, the research clearly indicates that it is not working. Why not? Is it because principals are not very good observers, because their conceptions or models of effective teacher behavior are erroneous, or because, although they possess these abilities, for some reason they cannot or do not use them? (Medley & Coker, 1987).

The lament of Medley and Coker raises key questions that ought to be driving the research on teacher performance evaluation for the next decade.

1. Can principals and other supervisors be trained to do better classroom observations?

2. Can they be helped by more accurate conceptions or models of effective teaching?

3. Can they be persuaded to devote the amount of time and effort to teacher performance evaluation necessary to put this knowledge to use?

4. Would principals do a better job of teacher performance evaluation if their own performance ratings depended on it?

Empirical tests of the accuracy of teacher ratings from 1921 through 1959 (a total of 11) all reached the same answer: the correlations between the average principal's ratings of teacher performance and direct measures of teacher effectiveness were near zero (Medley & Coker, 1987).

In a recent study involving 46 principals and 322 teachers, Medley and Coker (1987) provided no support whatever for the widely held belief that the average

190 R.P. MANATT & B. DANIELS

principal is a good judge of teacher performance. Nonetheless, their approach to the problem was a major contribution.

Total-systems performance evaluation

The total-systems approach to school improvement used by the School Improve- ment Model (SIM) afforded a longitudinal experiment that enabled the researchers to address each of the key issues of teacher performance evaluation: accurate conceptions of effective teaching, sufficient time and effort, appropriate training, and overcoming the conflict inherent in supervision (Manatt & Stow, 1986).

The teacher performance evaluation segment of the larger study required five years to complete. Because teaching is interactive, complex, and a very difficult human endeavor, it appeared likely that the previous attempts to link principals' ratings of teacher performance and student achievement were underfunded and, consequently, too short in duration to put all of the pieces into place. Moreover, previous attempts at this linkage did not have the benefit of the sound research model developed by Medley and Coker (Grant No. NIE-6-82-0029).

Accurate conceptions~models of teaching

When seeking a better understanding of effective teaching in the 1980s, researchers and practitioners quickly realize that (1) research on effective teaching is better than it was in the previous three decades and has much to offer for those seeking better concepts and models of teaching, and (2) unfortunately, effective teaching research applies mainly to structured content (e.g., math procedures or science facts) and does not necessarily apply to less structured content (e.g., the analysis of literature or historical trends) (Rosenshine & Stevens, 1986). Furthermore, especially for practi- tioners, the work of M~deline Hunter in developing the Teacher Decision-Making Model had helped clarify the principles of teaching.

Hunter's principles of teaching and the effective teaching research, however, are not one and the same. McGreal distinguishes between the two by defining the effective teaching research as a combination of correlational studies tying teacher behavior to student outcomes. The Hunter material, on the other hand, is theory (McGreal, 1983). Although Hunter uses the effective teaching research, she goes beyond what the research has identified, creating a teaching act, part of which is not necessarily tied to the research. The effective teaching research and the Hunter principles are compatible; however, knowledge of both is essential for preparing teacher performance evaluators (Tracy & MacNaughton, 1989).

For the present study, the four school districts involved used the Hunter model and the effective teaching research as sources of performance expectations. District personnel and the investigators from Iowa State University realized that the con-

PRINCIPALS ' RATINGS AND STUDENT A C H I E V E M E N T 191

ceptualization of effective teaching was not complete but, pragmatically, agreed that it was better than anything else available.

Glickman's (1987) discussion of knowledge versus certainty also relates to this eclectic approach. In referring to the effective teaching research, he says, "W e have knowledge but not certainty as to what improves instruction." Glickman views this knowledge base as a foundation that we can alter and build upon in the years ahead.

The process of developing and testing the teacher evaluation instruments for this investigation has ben reported in detail elsewhere. (Manatt, 1987). It suffices here to say that the planning committees, working in each of the school organizations, concluded that the research does support many of Hunter 's concepts, that is, objectives, review, clear explanations with examples and nonexamples, short bits of information, regular checks for understanding, motivation, wait time, and closure.

Time and effort

The amount of time and effort devoted to teacher performance evaluation is generally considered to be very high. To provide more than cursory, bureaucratic performance evaluation (i.e., a one-hour look-diagnose-and-prescribe farce) would require much more time than is customary outside of career ladder states. Possibly, as Anderson (1989) contends, working one on one with teachers may be an obsolete supervisory pattern: "The amount of time and energy that goes into instructional supervision, compared with administratively-necessary classroom visits and related communications, is so close to zero that research about effective supervision is senseless."

When principals and other teacher performance evaluators have more skill, and when they attempt to improve teacher performance, not just rate it, teacher performance evaluation does take more time. The School Improvement Model (SIM) investigation, of which this study is a part, found that time per teacher increased from 6 hours to 12 hours once evaluators were sufficiently prepared with a clear understanding of their district's model of effective teaching and clinical supervision training.

Other obstacles

Since 1983 the purposes of teacher performance evaluation have broadened considerably. When teacher performance evaluation purposes are limited to "helping the teacher improve" and a cursory kind of accountability check, most teachers accept it graciously or at least tolerate it as "par t of the j o b . " The current wave of school reform has extended the purposes of teacher performance evaluation to include (in over 20 states) career ladder placement or pay-for-performance. Teachers are much more concerned with this type of evaluation, and they often

192 R.P. MANATT & B. DANIELS

disagree with their evaluations publicly. For example, 200 lawsuits were filed by Florida teachers in the first year of that state's Master Teacher Program (Hazi, 1989).

Most teachers, of course, won't sue--but they will get hostile and challenge their principal in regard to evaluative ratings and unsupported judgments. Blumberg (1980) has been outspoken in acknowledging the conflict and dissention endemic to supervision; he calls the relationships between supervisors and teachers "a private cold war!" More recently, Blumberg and Jones (1987) discussed the teacher's power to control the supervisor's view of the inner working of classroom decision-making and asserted that there is a latent struggle for power and control in supervising relationships.

The teachers' most common behavior in this struggle is to complain about performance criteria which are rated low on their summative evaluation reports, and the most common countermove on the part of those principals who do not have adequate supportive data is to simply raise the ratings. This results in a leniency bias and tends to inflate evaluations.

The evaluation reports used in the present study were given to the researchers without being shown to the teachers. These reports were for research purposes to validate the criteria used against student achievement. They were not "official evaluations" for personnel purposes.

Training

Principals and other supervisors seldom have more than one course in supervision during their preparation at the master's degree level. Unfortunately, that one course is often general supervision, more appropriate for a superintendent, rather than clinical supervision needed by a soon-to-be, first-line supervisor. Garman (1986) has long argued that to practice clinical supervision one needs prolonged training and a supervised internship. Even then, she cautions against sanctioning the practitioner of clinical supervision as an expert.

The researchers involved in the SIM took the position that teacher evaluators need a critical mass of training for the five steps of clinical supervision as proposed by Cogan (1973), Anderson (Goldhammer, Anderson, & Krajewski, 1980), and Goldhammer (1969) in their original works at the Harvard Graduate School of Education in the 1950s. Because the SIM effort intended to stress effective teaching research, SIM performance criteria and related training added emphasis to how teachers teach and the learning strategies they use as they relate to learning theory and effective teaching research. Since the locus of these performance criteria was external, the training provided would be called, in present-day language, neotraditional clinical supervision (Tracy & MacNaughton, 1989). Indeed, Madeline Hunter, Erline Minton, and members of the training cadre from the University

PRINCIPALS' RATINGS AND STUDENT ACHIEVEMENT 193

Elementary School at UCLA were used to provide the initial training for all teacher evaluators and to prepare the SIM trainers who continued the work.

Neoprogressive clinical supervision takes a different approach. Proponents of the neoprogressive model frequently come from the scholarly community and from teachers' groups that have had negative experiences with neotraditionalists. The neoprogressives differ with the traditionalists on several points, some of which are use of research on teaching in setting the expectations, amount of teacher control of the agenda, use of the preconference, and role of assessment in the clinical process. Arthur L. Costa, Barbara Pavan, Noreen Garman, Nelson L. Haggerson, and Robert Slavin are among the researchers and theorists associated with the neoprogressive model of clinical supervision (Tracy & MacNaughton, 1989).

Neotraditional clinical supervision espouses a philosophy that is more akin to the behaviorist according to Hunter. The research base began with Thorndyke who showed that practice in itself, without knowledge of results of what was right or wrong and how to fix it, did not improve performance (Garman, Glickman, Hunter, & Haggerson, 1987). The present investigators argued that a direct connection runs between the criteria set up from effective teaching research and Hunter's theoretical approach. Consequently, the ten or more days of evaluation training provided to the principals of this study was Hunter-based, and several of Hunter's theoretical concepts were included on the performance evaluation instruments created by the four school districts. 1

Statement of the problem

Certain limitations in the methodology and instrumentation available when they were used in past research projects might account for the negative results reported (Medley & Coker, 1987).

Among those limitations are the following:

1. Contamination by interschool differences. To obtain a sample of teachers large enough to yield stable correlational estimates, previous investigators found it necessary to draw a sample of teachers from more than one school. Thus, ratings by different principals were intermingled in estimating a correlation. Differences among principals in observation skills, concepts of effective teaching, and ability to judge teacher performance could have distorted the correlational estimates and, therefore, masked any relationships that may have existed.

2. Content relevance o f tests. For similar reasons, each previous investigator used the same achievement tests in all of the schools in the sample studied. Differences in objectives of different schools may have resulted in differences in the fit between the objectives measured by the test and objectives sought by different teachers and further distorted (and therefore underestimated) the correlations.

194 R.P. MANATT & B. DANIELS

3. Violated assumptions. Statistical techniques used to isolate the teacher's contribution to student learning from that of other factors (especially student ability and previous achievement) involved assumptions not likely to be fulfilled, including the assumption that the correlation of student ability and previous achievement with end-of-year achievement is equal in different teachers' classes.

4. Regression artifiact. Statistical procedures used to measure teacher effectiveness from student test scores in the past were based on the achievement gain of the average student in each teacher's class, Since classes differ widely in average ability, it was necessary to have comparisons between teachers on gains of students of different ability and then make statistical adjustments to compensate. It has been shown that, because of artifact regression, these adjustments tend to exaggerate the differences instead of reducing them (Campbell & Erlebacher, 1971).

By taking advantage of the uniform characteristics across all four school organizations in SIM and by using recent advances in statistical methodology, it was possible to design the present study in a manner to avoid these four limitations. In addition, the principals' judgments made to provide the final rating were not shown to the teachers involved and not used in a conference setting with the teachers. Of course, this was a violation of the philosophy of clinical supervision, but it was deemed a necessary control to avoid mixing the teachers' perceptions with the principals' judgments.

The study addresses three questions:

1. Can supervisors' ratings of teachers be used to predict student achievement? 2. Which teacher performance criteria selected from the Hunter model and

research on effective teaching are related to higher achievement in fourth grade reading and mathematics and eighth grade mathematics?

3. Which logical groupings of teacher evaluation criteria, clustered into per- formance areas, are related to higher student achievement?

Procedures

Instrumentation

Each teacher in the study was evaluated at the end of the school year using the SIM Teacher Performance Evaluation Instrument which measures teacher performance on 25 criteria. This instrument contained all of the common criteria chosen by the four public school districts in the SIM consortium during the planning years of 1980-1982. The instrument was created as a synthesis of the criteria chosen by the stakeholders' (planning) committees of each district after careful study of the research on effective teaching and the Hunter model. 2 Principals' ratings were recorded on this instrument by using a continuum of 1 (low performance) to 7 (high performance). In the language of the district's response mode:

PRINCIPALS' RATINGS AND STUDENT ACHIEVEMENT 195

1.0 - 2.9 = Needs Improvement 3.0 - 4.0 = Meets Standards 5.0 - 7.0 -- Exceeds Standards

Teacher effectiveness in teaching reading and /o r mathematics was estimated f rom standardized achievement tests (norm-referenced) and locally constructed, district- specific, criterion-referenced tests.

Design features

The following steps were taken to avoid the limitations of previous studies.

Contamination by interschool differences. All teacher evaluators were given common training to enhance observational skills and were thoroughly oriented to the same concepts of effective teaching. Evaluators received at least ten days of training to increase their ability to evaluate teacher performance.

Content relevance o f tests. The tests used in each class of each school were administered as a part of the five-year SIM project. The norm-referenced tests were adopted by each district as a part of their regular testing program, while the criterion-referenced tests had been developed in the previous two years and reflected the curriculum alignment activities provided by SIM. This insured that the effectiveness of each teacher was measured in terms of success. It also meant that the test scores in different schools were not comparable. In order to compare the results within and across teachers' classes and schools, and to insure equal intervals on the measurement scale, the raw pretest and posttest class means (both norm- referenced and criterion-referenced) were converted to standard (Z) scores which had a mean of 50 and a standard deviation of 10. Violated assumptions. To control for student ability and previous achievement, stepwise multiple aggression was used. It has been well established that the pretest scores account for most of the variance on posttest scores, so the present study was concerned with relationships between the teacher performance ratings and the posttest scores, after the effects of the pretest had been removed. Thus, each multiple regression test used the pretest scores as one independent variable and the individual performance criteria ratings, mean performance ratings, or mean cluster ratings as the second independent variable. The posttest score was, in each case, the dependent variable. The multiple regression tests determined whether the performance ratings contributed to the prediction of the posttest scores, given the importance of pretest scores. Regression artifact. Since the pr imary analysis was on the dependent variable of posttest scores (not change scores) and because covariance or control was maintained by using pretest scores as the first controlling variable, the effect of the regression artifact has been minimized.

196 R.P. MANATT & B. DANIELS

Results

As in previous studies, principals' rating of teachers' performance did not predict student achievement on the norm-referenced tests. Such tests apparently are not sensitive enough to reflect differences in teaching performance. Fortunately, for this investigation, custom-tailored, criterion-referenced pretests and posttests had been developed, pilot tested, and refined over a 24-month period. The content relevancy of the tests was assured in this manner.

The first question relating to the ability of supervisors' ratings of teachers to predict student achievement is answered in tables 1, 2, and 3. The hypothesis that the ratings of individual teacher performance criteria did not contribute to the prediction of posttest success on fourth grade mathematics and reading and eighth grade mathematics (after the effects of the pretest were removed) was tested using the forward stepwise multiple-regression procedure. As expected, the regression analysis revealed that the pretest was the best predictor of the posttest scores. However, each of the significant performance criteria, when used as a predictor, also accounted for additional percentages of the posttest variations. A t-value is given for the test of the coefficient of the independent variable (in this case, the per- formance criteria ratings) being O. Note that the mix of significant criteria varied by both subject and grade level. In all, 21 of the 25 criteria were significant in at least one subject and grade level.

Which teacher performance criteria selected from the Hunter model and research on effective teaching relate to high achievement? Performance appraisal of fourth grade mathematics teachers resulted in the greatest number of significant criteria, 13 (table 1). The mean criteria rating was also significant. Fourth grade reading had a lesser number of significant criteria, 7. Only 4 were common to both subjects: "Communicates effectively," "Effective use of materials and resources," "Organizes students," and "Effective interpersonal relations" (table 2).

Eight performance criteria were significant predictors of achievement in eighth grade mathematics (table 3). These eight accounted for a lesser percentage of variance than those at the fourth grade level. Several were unique to eighth grade mathematics: "Promotes positive self-concept," "Demonstrates sensitivity," "Promotes self-discipline," "Involved with reaching goals," and "Desires feedback from students." Generally speaking, these are subsumed under the heading of "Positive Interpersonal Relations" and appear most desirable for teachers of early adolescents. The mean criteria rating was also significant.

It is of interest that only criterion 10, "Effective interpersonal relations," was common to all grades and subjects.

Finally, in order to answer question 3, the criteria were grouped according to the headings of Area 1: Productive Teaching Techniques; Area 2: Organized, Structured Class Management; Area 3: Positive Interpersonal Relations; and Area 4: Professional Responsibilities. Inspection of table 4 reveals that each of the areas was significantly related to student achievement in at least one subject. The rubric of Productive Teaching Techniques was significant for fourth grade mathematics only. Classroom Management was significant for fourth grade mathematics and reading. Positive Interpersonal Relations as a performance area was significant for mathematics at both the fourth and eighth grade levels. Professional Responsi-

PRINCIPALS ' RATINGS AND STUDENT AC HIE VE M E NT 197

Table 1. Stepwise multiple-regression analysis of teacher performance ratings as predictors of student achievement - -Four th grade mathematics .

Teacher Performance Criteria Additional Percent of Variation§

t-Value

9. Organizes students for effective instruction.

22. Demonstrates effective planning.

17. Monitors seatwork closely.

16. Uses guided practice before independent practice.

2. Communica tes effectively with students.

15. Demonstrates processes at beginning of learning (cueing).

8. Manages student behavior in a constructive manner .

1. Demonstrates ability to inspire and motivate students.

4. Prepares appropriate evaluation feedback.

25. Moves quickly through the curriculum.

24. Desires feedback from supervisors and principals.

6. Effectively uses available materials and resources.

10. Demonstrates effective interpersonal relationships.

Mean Criteria Rating

11.4% 3.04**

10.0% 2.80**

9.7% 2.75*

9.3% 2.68*

8.9% 2.61"

8.7% 2.57*

8.3% 2.50*

8.3% 2.50*

8.0% 2.44*

7.9% 2.43*

7.3% 2.32*

6.9% 2.23*

6.6% 3.19"

8.0% 2.49*

* Significant at p (.05. ** Significant at p (.01.

§ Pretest accounts for 51 percent of posttest variation. n 34 teachers.

Table 2. Stepwise multiple-regression analysis of teacher performance ratings as predictors o f student ach ievement - -Four th grade reading.

Teacher Performance Criteria Additional Percent of Variation§

t-Value

21. Displays a high energy level.

6. Effectively uses available materials and resources.

7. Demonstrates effective planning and organization.

3. Uses variety of evaluation methods with specific feedback.

9. Organizes students for effective instruction.

10. Demonstrates effective interpersonal relationships.

2. Communica tes effectively with students.

11.8% 3.06"

11.5O7o 3.00**

10.3% 2.81"

6.9% 2.21"

6.8% 2.18"

6.4% 2.11"

6.2% 2.07*

* Significant at p (.05. ** Significant at p (.01.

§ Pretest accounts for 49 percent of posttest variation. n = 34 teachers.

198 R.P. MANATT & B. DANIELS

Table 3. Stepwise multiple-regression analysis of teacher performance ratings as predictors of student achievement--Eighth grade mathematics.

Teacher Performance Criteria Additional Percent of Variation8

t-Value

12. Demonstrates sensitivity in relating to students.

13. Promotes self-discipline and responsibility.

22. Demonstrates effective planning.

23. Desires feedback from students.

14. Involved with reaching district and building goals.

1. Demonstrates ability to inspire and motivate students.

10. Demonstrates effective interpersonal relationships.

11. Promotes positive self-concept. Mean Criteria Rating

4.5o7o 3.61"*

3.2% 2.73*

3.2O7o 2.73*

3.1% 2.69*

3.0% 2.58*

2.5% 2.31"

2.4°70 2.24*

2.3% 2.19" 3.0% 2.34*

* Significant at p (.05. ** Significant at p (.01. 8 Pretest accounts for 90 percent of posttest variation.

n - 19 teachers.

Table 4. Stepwise multiple-regression analysis of teacher performance ratings clustered in logical performance areas as predictors of student achievement.

Logical Performance Area 4th Math- -CRT 4th Read--CRT 8th Math- -CRT

°/0 of t- % of t- °70 of t Var.8 Value Var.88 Value Var.§88 Value

Area 1 (Productive 8.7% 2.57* not significant not significant Teaching Techniques)

Area 2 (Organized, 9.7% 2.75** 8.0% 2.40* not significant Structured Class Management)

Area 3 (Positive 7.9% 2.04 noi significant 3.4% 2.87* Interpersonal Relations)

Area 4 (Professional not significant not significant 2.8% 2.77* Responsibilities)

* Significant at p (.05. ** Significant at p (.01.

8 Pretest accounts for 51 percent of posttest variation. §8 Pretest accounts for 49 percent of posttest variation.

8§8 Pretest accounts for 90 percent of posttest variation.

PRINCIPALS' RATINGS AND STUDENT ACHIEVEMENT 199

Table 5. Predictive ability of all 25 criteria by subject and grade level.

4th Grade 8th Grade Criterion Mathematics R e a d i n g Mathematics

1. Motivation of students 2. Effective communication with studetns 3. Variety in evaluation with feedback 4. Appropriate evaluation feedback 5. Provision for individual differences 6. Material and resource use 7. Planning and organization 8. Management of behavior 9. Organization for instruction

10. Interpersonal relationships 11. Positive self-concept 12. Sensitivity toward students 13. Self-discipline and responsibility 14. District and building goals 15. Use of cueing 16. Use of practice 17. Monitoring of seatwork 18. Modeling and concrete examples 19. High expectations 20. Pleasant but not extreme 21. High energy level 22. Effective planning 23. Student feedback 24. Supervisor feedback 25. Coverage of curriculum

Mean Criteria Rating

X X X X X X X X X X X X X X X X X X X X X X X X X X X

X

X X X

X X X X X

x x x

x x

x x

bilities as a cluster of criteria surprisingly proved to be significantly related to

achievement in eight grade mathemat ics . Thus, the answer to quest ion 3 is that all

four logical groupings of teacher per formance criteria had some predictive value, varying by grade level and subject taught .

Of the 25 criteria rated, all but 4 were significant predictors of achievement for

at least one subject and grade level (table 5). The four criteria which were not

predictive of s tudent achievement were "Prov ides for individual differences,"

"Mode l i ng and concrete examples , " " H i g h expecta t ions ," and "P lea san t but not

affectively ex t reme."

Discussion

The p a r a m o u n t finding of this s tudy is that principals can accurately evaluate the pe r fo rmance of teachers. W h e n principals were given extensive t ra ining and when

200 R.P. MANATT & B. DANIELS

the limitations of earlier studies regarding instrumentation and methodology are overcome, they are good judges of teacher performance. The majority of the criteria selected for the performance evaluation instrument worked, albeit with differing results for different subjects and grade levels.

Why did this combination of concepts, training, and performance evaluation criteria succeed while so many previous investigations have not? First, enough training was provided. Hunter has repeatedly noted that the students of essential elements of effective instruction progress through three levels: propositional knowledge, performance knowledge, and contingent knowledge. Students must begin by understanding the if-then propositions of the principles of learning. Next, they must demonstrate the effective techniques, and, finally, they can determine if a particular teaching technique is needed. In our training we repeatedly noted this progression. Teacher evaluators, with only a few days of training, would rate a teacher at the top level for merely exhibiting a particular behavior. Later, with more training and feedback, they could assess the quality of a teaching behavior.

A second and very important reason for success in this investigation was that curricular-aligned testing was used instead of simply relying on norm-referenced tests. The criterion-referenced pretests and posttests were much more sensitive to changes in teacher behaviors.

When examining the individual criteria and their ability to predict student achievement, it was remarkable that "modeling" and "individualized instruction" has no predictive ability. When the ratings were examined, most of the teachers had high ratings for modeling, perhaps a reflection of the five years of training for both teachers and administrators. Individualized instruction was simply a neutral criterion: it didn't hurt and it didn't help the prediction of achievement.

In the final analysis, teacher performance evaluation must be valid, reliable, legally discriminating, and economical. The principals in this study were able to rate accurately the effective and less effective teachers. Because the overarching study, the School Improvement Model, included a benefit/cost component, we were also able to provide cost estimates for the evaluation (after training costs were removed) of each teacher. Average costs ranged from $116 for the largest district to $108 for the smallest district (Darnell, 1984, p. 48). This price seems reasonable to pay for accurate assessment of teacher performance.

Acknowledgments

We gratefully acknowledge support for this study from the Northwest Area Foundation of St. Paul, Minnesota. The conclusions reported and the opinions expressed are those of the authors and do not necessarily reflect the views of the foundation. Address correspondence to Richard P. Manatt, Director, School Improvement Model, E005 Lagomarcino Hall, Iowa State University, Ames, IA 50011.

PRINCIPALS' RATINGS AND STUDENT ACHIEVEMENT 201

Notes

I. The training provided had one element of neoprogressive clinical supervision included--the preobservation conference. This element was requested by all four of the districts' planning committees. 2. For a detailed presentation of the common criteria see Manatt and Stow, 1984.

References

Anderson, Robert H. (1989). Unanswered questins about the effect of supervision on teacher behavior. Journal o f Curriculum and Supervision, 4(4), 291-297.

Blumberg, Arthur. (1980). Supervisors and teachers: A private coM war. Berkeley, CA: McCutchan. Blumberg, Arthur, & Jones, R. Steven. (1987, May). The teacher's control over supervision. Educational

Leadership, 44, 58-63. Campbell, D.T., & Erlebacher, A. (1971) How regression artifacts in quasi-experimental evaluations can

mistakenly make compensatory education look harmful. In G. Hellmuth (Ed.), The Disadvantaged Child, Vol. 3. New York: Mazel.

Cogan, David F. (1973). Clinical supervision. Boston: Houghton Mifflin. Darnell, David F. (1984). An analysis of the costs of administration in teacher evaluation, unpublished

dissertation, Iowa State University, Ames. Garman, Noreen B. (1986, Winter). Clinical supervision: Quackery or remedy for professional

development. Journal o f Curriculum and Supervision, 1, 148-157. Garman, Noreen B., Glickman, Carl D., Hunter, Madeline, & Haggerson, Nelson L. (1987, Winter).

Conflicting conceptions of clinical supervision and the enhancement of professional growth with renewal: Point and counterpoint. Journal o f Curriculum and Supervision, 2, 157.

Glickman Carl D. (1987). Supervision for instructional improvement, tape recording of a presentation to the 1987 annual meeting of ASCD. Alexandria, VA: Association for Supervision and Curriculum Development.

Goldhammer, Robert. (1969). Clinical supervision: special methods for the supervision o f teachers. New York: Holt, Rinehart and Winston.

Goldhammer, Robert, Anderson, Robert H., & Krajewski, Robert J. (1980). Clinical supervision: Methods for the supervision o f teachers. New York: Holt, Rinehart and Winston.

Hazi, Helen M. (1989). Measurement versus supervisory judgment: The case of Sweeney vs. Turlington. Journal o f Curriculum and Supervision, 4(2), 211-229.

Manatt, Richard P. (1987). Lessons from a comprehensive performance appraisal project. Educational Leadership, 44(7), 8-14.

Manatt, Richard P., & Shirley B. (1984). Clinical manual for teacher performance evaluation. Ames: Iowa State University Research Foundation.

Manatt, Richard P., & Stow, Shirley B. (1986, February). Developing and testing a model for measuring and improving educational outcomes o f K-12 schools, technical report, School Improvement Model (SIM). Ames: Iowa State University.

McGreal, Thomas L. (1983). Successful teacher evaluation. Alexandria, VA: Association for Supervision and Curriculum Development.

Medley, Donald M., & Coker, Homer. (1987). The accuracy of principals' judgments of teacher performance. Journal o f Educational Research, 80(4), 242-247.

Rosenshine, Barak V., & Stevens, R. (1986). Teaching functions. In Handbook o f research on teaching. New York: Macmillan, pp. 376-391.

Tracy, Saundra J., & MacNaughton, Robert H. (1989). Clinical supervision and the emerging conflict between the rico-traditionalists and the rico-progressives. Journal o f Curriculum and Supervision, 4(3), 246-256.