longitudinal assessment of critical thinking

28
LONGITUDINAL ASSESSMENT OF CRITICAL THINKING IN COLLEGE WHAT MEASURES ASSESS CURRICULAR IMPACT? Marcia Mentkowski Glen Rogers Office of Research & Evaluation Alverno College

Upload: glen-rogers

Post on 14-Apr-2017

108 views

Category:

Documents


1 download

TRANSCRIPT

LONGITUDINAL ASSESSMENT

OF CRITICAL THINKING

IN COLLEGE

WHAT MEASURES ASSESS

CURRICULAR IMPACT?

Marcia Mentkowski

Glen Rogers

Office of Research & Evaluation

Alverno College

Findings reported here are

based upon a study funded by

a grant from the National

Institute of Education (NIE-G-

77-0058), as one part of a

larger research project. Tamar

Ben-Ur assisted with the

statistical analyses reported in

this paper.

Paper presented at the annual

meeting of The Mid-Western

Educational Research

Association in Chicago,

October 1985.

Longitudinal Assessment

of Critical Thinking in

College

What Measures Assess

Curricular Impact?

Marcia Mentkowski Glen Rogers

Educational Research and Evaluation

ALVERNO COLLEGE

Milwaukee, WI

This publication is available from:

Alverno College Institute

3400 South 43rd Street

PO Box 343922

Milwaukee, WI 53234-3922

Phone: 414-382-6000

www.alverno.edu

Graphic Design: Lynn Chabot-Long, Project Specialist, Educational Research and Evaluation

Copyright 1985. Alverno College Institute, Milwaukee, Wisconsin. All rights reserved under U.S., International and

Universal Copyright Conventions. Reproduction in part or whole by any method is prohibited by law.

Longitudinal Assessment of Critical Thinking in College Page i

ABSTRACT

Longitudinal results from a college outcomes study were used to judge one widely used and three new

measures of critical thinking. Each measure was reviewed for how well it measured longitudinal vs.

cross-sectional change, for whether it was associated with progress in the curriculum, for its association

to background variables (e.g., High school GPA), and for whether it assessed change in critical thinking

for both traditional-age and older adult students. Two of the newer measures are better college

outcomes measures on these criteria. Interrelationships among the measures suggest that critical

thinking is made up of multiple components, so researchers and educators are cautioned to improve

both production and recognition measures.

INTRODUCTION

Objectives

This study explores (1) the validity of a set of four critical thinking measures, (2) the possibility of an

expanded domain of critical thinking abilities, (3) the degree and pattern of change in critical thinking

abilities in an undergraduate population, and (4) the relationship between progress in the curriculum,

which is outcome-centered and performance-based, and change in critical thinking abilities over the

course of four years.

Perspective and Theoretical Background

Educators and researchers alike are strengthening their commitment to critical thinking as a college

outcome. The American Association of Colleges, in a recent report redefining the baccalaureate, has

identified critical thinking as one of the outcomes all colleges should prepare students to demonstrate

(Integrity in the college curriculum, February 1985). In 1984, major conferences at Wingspread, Harvard,

Sonoma State and The University of Chicago addressed critical thinking assessment. Accrediting agencies

(e.g., North Central, COPA) are also calling for assessment of broad college outcomes. Recently,

attention has turned to both the assessment of individual student learning and to the institutional

evaluation of student outcomes (Marchese, 1985; Ewell, 1980; Marchese, 1985, Mentkowski & Doherty,

1984).

The need for measures that assess broad college outcomes and that are not limited to amount of

knowledge is critical. But, how are administrators and faculty to do so without instruments that

contribute to cross-college measurement, and that have face validity for liberal arts faculty? Many

liberal arts faculty do not just focus on knowledge measured by recognition tasks like SATs or GREs, but

instead focus on higher order cognitive processes that are measured by production tasks and that have

been conceptualized as central to newer definitions of critical thinking (Lipman, 1984; Nickerson, 1984;

Paul, 1984; Sigel, 1984; Sternberg, 1983; Winter & McClelland, 1978). Moreover, there is a reciprocal

relation between assessment and teaching. For example, Frederiksen (1984) observes in the American

Psychologist that use of the newer production measures could also encourage teaching of higher level

cognitive skills and provide practice with feedback. Several researchers have designed newer measures,

but how well do they work as college outcomes measures?

METHOD

Critical Thinking Measures

A total of 12 cognitive-developmental, learning style and generic ability measures were administered in

a longitudinal study of college outcomes (Mentkowski & Doherty, 1983). Four of these are the focus of

this analysis of critical thinking measures. Of these, two involve, at least partially, the more typical

recognition tasks whereby the participant must select the correct answer:

Longitudinal Assessment of Critical Thinking in College Page 2

(1) Watson and Glaser’s Critical Thinking Appraisal (Watson & Glaser, 1964) is a standardized, easily

used recognition measure of component critical thinking abilities. The Form ZM subscales

assessing quality of inferences, recognition of assumptions, and deductive reasoning were

administered in this study. These subscales are henceforth designated as the Inference,

Recognition of Assumptions, and Deduction subscales, respectively. The interpretation subscale

and the evaluation of argument subscale were not administered.

(2) Test of Cognitive Development (Renner, Fuller, Lockhead, Tomlinson-Keasey & Campbell, 1976)

is a newer series of paper and pencil tasks designed to measure formal operational thinking as

defined by Fiaget. Although participant responses for several tasks are scored from multiple

choice answers, written justification of the answers often is considered in scoring as well.

Scoring of the flexibility of rods task focuses upon participants’ written explanation of their

reasoning. The theoretical justification for the test suggests sophisticated cognitive processes

are being measured.

The other two critical thinking measures administered are pure production measures:

(3) Analysis of Argument (Stewart, 1977a, 1977b) is designed to measure flexibility in arguing (with

consistency) for opposite positions on a specified issue. In response to the stimulus, an

emotional one-sided essay, participants write two essays. The “Attack” essay is scored for

whether it has a central organizing principle and whether it focuses on the faulty logic of the

stimulus essay. The “Defense” essay is scored for whether it reflects a modified or qualified

endorsement of the counter-attitudinal stimulus essay.

(4) Test of Thematic Analysis (Winter, 1976; Winter & McClelland, 1978) is designed to measure the

ability to form complex concepts and communicate them. The task requires participants to

compare and contrast two sets of essays according to the themes in the two essay sets. Set A

includes three stimuli essays about four sentences long, as does Set B. Appendix A gives the

brief titles of the nine scoring criteria.

Study Design and Inventory Administration

All undergraduates who entered a women’s college in 1976 and 1977 were recruited as volunteer

participants in the longitudinal study. In 1977, a Weekend College timeframe was offered for the first

time. Thus, the 1977 cohort includes students from both the weekend and weekday timeframes. The

longitudinal cohorts were assessed at entrance, two years after entrance, and three and one-half years

after entrance.

As a cross-sectional comparison group, the entire 1978 weekday graduating class was recruited for

assessment at graduation. In order to control for attrition effects, the entrance scores of the students in

the 1977 weekday longitudinal cohort who did not graduate were deleted when they were cross-

Longitudinal Assessment of Critical Thinking in College Page 3

sectionally compared to the scores of the 1978 graduates. Measures were administered in large group

sessions.

Data Source, Attrition, and Participation

Alverno College is a women’s college. At the time of the study, the students were predominantly

Caucasian and from one midwest state. Many were first-generation college students. The college

traditionally has served working class students from a large urban area, and has not been highly

selective.

At the first assessment, all of the women who entered in 1976 and 1977 were recruited as participants.

To be considered eligible for recruitment for later longitudinal assessments, the students had to be

currently enrolled and to have completed the prior assessments on at least a subset of the inventories.

The weekend college attracted older students in increased numbers. Longitudinal data analyzed for this

study include all students who participated on all three occasions for a particular instrument. Between

83% to 99% of the students participated at each assessment for at least a subset of the inventories. The

longitudinal data pool, n = 208, included both traditional-age (17–19 years; n = 108) and older students

(20–55 years; n = 100). The graduating class used as a cross-sectional comparison group also included

both traditional age (n = 45) and older students (n = 15).

Since not all participants completed all of the inventories, the number of observations per inventory

varies. The proportion of older and younger participants remains about the same. The lowest number of

replicated observations across three assessments occurred for the Analysis of Argument “Attack” essay,

n = 133, and “Defense” essay, n = 130. This is accounted for by a delayed decision to include the

inventory in the study, which resulted in a reduced number of observations at time 1 for this inventory.

For the other inventories reported here, between 181 and 194 participants completed all three

assessments for the particular inventory. The vast majority of the participants completed all of the

instruments. For the purposes of obtaining internal reliability estimates only, the Analysis of Argument

scores were used for all those who completed the Analysis of Argument inventory at the time of

assessment, even if they did not complete all three assessments. Thus, between 183 and 189

participants were available for these time 2 and time 3 reliability analyses.

Main Analyses Employed

Raw change over occasions of assessment was studied with multivariate analysis of variance for

repeated measures and unequal n’s in an Age X Time factorial design, with the repeated measures on

the Time of assessment. In this analysis, both the linear and the quadratic effects of Time were tested

with orthogonal polynomial contrasts. Weights were used to adjust for the unequal lengths of Time

between the intervals. The association of the critical thinking measures with rate of progress in the

curriculum was measured with correlational analyses. Alpha coefficients were computed to determine

the internal reliability of measures at each time of assessment. The amount of variance accounted for by

scores from the previous administration of each measure was computed to determine test-retest

Longitudinal Assessment of Critical Thinking in College Page 4

predictability. Mentkowski and Strait (1983) have previously reported data analyses of this data set,

which are sometimes referenced here. The present analyses extend the previous analyses by exploring

the Age X Time interaction directly, by focusing entirely upon unadjusted scores, by exploring the

internal reliability of the summated measures, and by analyzing the component criteria for the Test of

Thematic Analysis and Analysis of Argument measures.

RESULTS AND CONCLUSIONS

Reliability

Analysis of Argument post-test scores showed no predictability from Analysis of Argument pre-test

scores. The inter-item reliability analysis of the “Attack” essay’s five scoring criteria (one was dropped

because of multicollinearity) yielded unacceptably low reliability coefficients at times two and three (see

Table 1), even though at least 184 students participated at these administrations. Although we believe

researchers need to accept lower reliability estimates for production measures, the coefficients for the

“Attack” essay’s scoring criteria at time two and time three do not inspire confidence in the current

measurement. Since there appeared to be sufficient variability in the scores for each scoring criterion,

we surmise that our relatively low internal reliability coefficients might be improved by increasing the

number of scoring criteria. Alternatively, a consistently unitary construct may not yet underlie the six

scoring criteria for the “Attack” essay. The internal reliability of the “Defense” essay’s four scoring

criteria was relatively high at each time of assessment (see Table 1).

The Test of Cognitive Development showed good internal reliability for the five item scale (see Table 1).

In addition, pretest scores for the Test of Cognitive Development also accounted for a high proportion of

the variance in the Test of Cognitive Development post-test scores. Averaging across the percent of

variance accounted for by time one in predicting time two and by time two in predicting time three,

29.87 of the variance (R Square) was statistically predictable.

The Test of Thematic Analysis appeared less internally reliable. The low internal reliability may be

partially due to the lack of variance in some of the nine scoring criteria, which were dichotomously

scored. The reliability coefficients for the scale were improved by removing just one scoring criterion,

“making exceptions or qualifications.” Another scoring criterion was removed because of its poor

correlation with the remaining seven criteria and its limited variance. This negatively scored criterion,

“Affect,” is scored if the participant makes a comparison that is based upon her emotional reaction to

the story. In order to prevent inappropriate extreme scores, two scoring criteria that yielded scores with

an extremely restricted variance were also removed, leaving 5 of 9 of the Test of Thematic Analysis

scoring criteria for a summated scale. These procedures yielded somewhat improved internal reliability

coefficients (see Table 1). Appendix A documents the scoring criteria included and excluded from this

exploratory 5 of 9 summated criteria-measure for the Test of Thematic Analysis.

Although internal reliability coefficients were not computed for the Critical Thinking Appraisal subscales,

the proportion of variance accounted for by pre-test scores for each interval confirms the test-retest

Longitudinal Assessment of Critical Thinking in College Page 5

reliability of these subscales. Averaged across the two intervals, pretest scores account for 26.3% of the

Inference post-test scores, for 19.17 of the Recognition of Assumptions post-test scores, and for 31.97

of the Deduction post-test scores. Although the relatively high predictability of the post-test scores from

the pretest scores supports the reliability of this set of Critical Thinking Appraisal measures, the lower

predictability of the post-test scores for the Test of Thematic of Analysis may be only due to individual

differences in response to educational or other influences.

A Face Validity Issue

A qualitative examination of the “Defense” essays written by the students suggested that the students

were role playing the emotional and faulty arguments they were asked to defend. This face validity

concern is consistent with the scoring of the “Defense” essays. At each time of assessment, the scoring

of the “Defense” essays according to the four scoring-criteria yielded very little variance among

students. In the case of the first assessment, for example, 937 “totally” endorsed the presented

argument (which was probably counter to their attitude); 917 presented new arguments supporting the

counter-attitudinal statement; 47 made a modified endorsement of it; and 47 accepted a particular part

of it. In other words, in the “Defense” essay, almost all of the entering freshman did exactly opposite of

what would yield a high score in the standard scoring. Our qualitative observation that the students

were role playing the bad arguments they were asked to defend is also consistent with the lack of

instructions to dissuade against role playing in the Analysis of Argument inventory. We also observe that

our students encounter numerous role playing exercises in their education.

Trends through Time

Our concerns about the face validity of the “Defense” essays and about the internal reliability of the

“Attack” essay scoring criteria are supported by the previously conducted analyses on this data set (see

Mentkowski and Strait, 1983). We know from these analyses that neither the “Attack” essay’s

summated measure nor the “Defense” essay’s summated measure yielded any statistically significant

differences across the times of assessment. Mentkowski and Strait (1983) also found that these

summated measures were not related to measures of progress in the college’s curriculum. The lack of

internal reliability for the “Attack” essay’s scoring criteria suggested an exploratory analysis with the

individual criteria as the unit of analysis. None of these individual scoring criteria yield statistically

significant differences for the Time of the assessment, however.

Given the exploratory nature of this research, the Test of Thematic Analysis was analyzed in more than

one way. Not only was a total score for the Test of Thematic Analysis included (in this case, the 5 of 9

criteria summated measure), but also, three of the scoring criteria derived from this measure were

separately analyzed. Thus, analysis of variance was also performed on (1) the criterion of making

exceptions or qualifications to one’s ascriptions, (2) the criterion of giving examples for observations,

and (3) the negatively scored criterion of affective reaction. It should be noted that the examples

criterion is a component of the Test of Thematic Analysis summated measure (5 of 9 criteria). As is

noted below, analysis of this summated measure and analysis of its component scoring criterion,

Longitudinal Assessment of Critical Thinking in College Page 6

examples, yield very similar results. This is probably largely because the scores yielded by the examples

scoring criterion actually contribute the bulk of the variance to the summated score. Thus, like the

summated scale, the examples scoring criterion does not correlate positively with the exceptions

criterion at any of the times of measurement. Again, the direction of association is always negative,

even though, in this case, it is not statistically significant.

The Test of Thematic Analysis scoring criterion for affect-based comparisons was scored for only 16% of

the essay comparisons. Because of low variance, a Friedman’s distribution-free one-way analysis of

variance procedure for time of assessment was again conducted. It showed no statistically significant

difference, Chi Square (2, 194), F < 1.

The Age X Time repeated measures analysis of variance procedure for unequal n’s and unequal intervals,

which tested for the linear and quadratic effects of Time, were performed on the following seven

measurements: the Test of Cognitive Development summated measure, the Test of Thematic Analysis

summated measure (5 of 9 criteria), the examples and exceptions scoring criteria for the Test of

Thematic Analysis, and the three subscales of the Critical Thinking Appraisal. This subset of seven

measurements, which includes both summated scales and single item scores, yielded many effects. The

Age X Time analyses yielded main effects for Age on 4 of 7 of these measurements (see Table 2). They

also yielded main effects for linear Time on all seven of these measurements. A main effect for quadratic

Time was revealed on the Test of Thematic Analysis (5 of 9 criteria) and the Recognition of Assumption

subscale of the Critical Thinking Appraisal (see Table 2). The lack of any interaction effects between Age

and Time (all F values less than 1) suggests that these main effects of linear Time, and quadratic Time

are relatively general for both age groups.

First, looking at those measures not associated with Age effects, we see that scores on the Test of

Cognitive Development and the scores on the Deduction subscale of the Critical Thinking Appraisal

improve linearly with Time (see Table 3). Recall that the analysis of variance procedure (see Table 2)

adjusted for unequal interval lengths.

Next, looking at those measures with Age main effects (see Table 4), older age students show a fairly

constant advantage on these measures at each time of assessment. There is no statistically significant

interaction between Age and Time for scores yielded by the exceptions scoring criterion despite the

apparent convergence of the means at time three for the two age groups.

The quadratic effect of Time for the Test of Thematic Analysis summated score (5 of 9) is reflected in a

drop on this measure at time three (see Table 4), which is confirmed by a posteriori analyses. There is

also a linear decrease with Time (see Table 2). The examples scoring criterion, which is a component of

this summated measure, likewise shows a linear (see Table 2) decrease with Time (see Table 4). In

contrast, the exceptions scoring criterion shows a linear (see Table 2) increase with Time (see Table 4).

Thus, by exploring the scoring criterion components of the Test of Thematic Analysis, we have found

trends in opposite directions.

Longitudinal Assessment of Critical Thinking in College Page 7

The Inference subscale of the Critical Thinking Appraisal shows a linear (see Table 2) increase with Time

(see Table 4). Like the linear increases with Time for the Deduction subscale (see Table 3), these

increases occur generally across our two age cohorts. The Recognition of Assumptions subscale shows

both a linear and quadratic effect (see Table 2). Examination of Table 4 shows an increase on the

Recognition of Assumptions subscale at the second interval, which is confirmed by a posteriori analyses.

Relation of Measures to Curricular Progress

Are these critical thinking and exploratory single item measures that have shown change as a function of

time of assessment sensitive to curricular impact? In preparation for examining the relation between

curriculum progress and performance on the critical thinking measures, we tabulated the measures of

curriculum progress according to two separate cumulative subtotals: one subtotal for the first two years

since enrollment and the other subtotal for the subsequent year and a half. The one curriculum progress

subtotal corresponds to the interval of time between the first and second fieldings of the critical

thinking measures, and the other curriculum progress subtotal corresponds to the interval of time

between the second and third fieldings of the critical thinking measures.

We reasoned that students showing a high degree of curriculum progress in the intervals of the study

would at the end of the intervals show higher performance on the critical thinking measures. Thus,

progress in the curriculum up to the second assessment was correlated with the critical thinking

measures at the second assessment, and progress in the curriculum between the second and third

assessments was correlated with the critical thinking measures at the third assessment.

The measures of curriculum progress employed are conceptually well suited for showing the kind of

student development relevant to growth of critical thinking. The curriculum of Alverno College is

abilities-based; On over 100 assessments Alverno students demonstrate to criteria each of eight broad

abilities set by the faculty (eg. communication, analysis, problem solving, and valuing). These

assessments lead to the credentialing of the student on the sequential and developmentally arranged

“competence level units.” The number of competence level units completed and the number of course

credits completed were employed as two separate measures of progress in the curriculum.

Examination of Table 5 shows that the Test of Cognitive Development scores at time two are positively

correlated with cumulative progress in the curriculum. At time three, the Test of Cognitive Development

scores do not correlate with the intervening progress in the curriculum (see Table 5).

The Test of Thematic Analysis showed a somewhat complex relation to progress in the curriculum. The

Test of Thematic Analysis summated scale (5 of 9 criteria) was negatively associated with curriculum

progress at both the second and third assessments.

While the examples scoring criterion performs similarly to the summated scale, of which it is a

component, the exceptions scoring criterion yields different results. Thus, the examples criterion is

negatively associated with curriculum progress at the second assessment, and, for at least one of the

Longitudinal Assessment of Critical Thinking in College Page 8

measures of curriculum progress, is also negatively associated with curriculum progress at the third

assessment (see Table 5). Again in contrast, the exceptions criterion shows a small positive relation to

progress in the curriculum at the third assessment (see Table 5).

For the three Critical Thinking Appraisal subscales only one of the twelve correlations with progress in

the curriculum is statistically significant (see Table 5). The Test of Cognitive Development and the Test of

Thematic Analysis and its single item subcomponents were related to progress in the curriculum, but the

Critical Thinking Appraisal measures generally were not. Why not?

Level of Thinking, Mastery of That Level, and Habituality of Use

We believe the lack of relationship of the Critical thinking Appraisal with the measures of curriculum

progress flows from the multiple choice format of the Critical Thinking Appraisal. Before we clarify why

we suspect this, we need to distinguish between some broad performance characteristics. From a

developmental perspective we might describe one performance as demonstrating a higher level of

thinking than another. Higher levels of thinking are perhaps more complex, broader in scope, and so on.

Also, a sophisticated thought process could be based upon the coordination of several thought

processes, and might be predicated upon the development of other thought processes.

We can conceptually distinguish the sophistication or level of a thought process from other performance

characteristics. An individual can be said to have mastered a particular thinking process when, at a

certain frequency, they can demonstrate the process of thinking when it is explicitly requested. But,

even if an individual has mastered a level or kind of thinking, they might not habitually employ it. For

example, they may not be called on to use it, they may have opposing tendencies, or they may not know

it is expected of them. The distinctions between sophistication of a thinking process, mastery of a

thinking process, and habitual use of a thinking process may help explain why some Critical Thinking

measures that showed development during the college years also showed a relationship to curriculum

progress, while others did not.

Because of the recognition character of the Critical Thinking Appraisal, we do not feel confident in it as a

measure of well-practiced mastery. College students can probably recognize higher levels of thinking

than they can systematically produce. Performance on the Critical Thinking Appraisal may have more to

do with the breadth of the exposure to the thinking processes the test taps, than with a well-practiced

mastery of the thinking processes. Thus, although we are able to show gains on all of the Critical

Thinking Appraisal subscales, these robust gains may indicate that it is relatively easy to show progress

on these recognition based measures.

Even those students progressing slowly in the curriculum may have a breadth of exposure to the

thinking processes required by the Critical Thinking Appraisal. Indeed, the breadth of exposure that

slower progressing students have compared to that of faster progressing students may be of equal value

in recognizing the “correct” answers on the Critical Thinking Appraisal. This assumption would help

Longitudinal Assessment of Critical Thinking in College Page 9

account for the finding that rate of progression in the curriculum is not generally related to subsequent

performance on the Critical Thinking Appraisal subscales.

Nonetheless, we must note that the Recognition of Assumptions subscale gains were most strongly

associated with the second interval in the study. This suggests that a sophisticated critical thinking

process was being measured. And, theoretically, the sophistication required to recognize assumptions of

arguments does seem high enough to account for delayed development in the curriculum. If a later

curriculum exposure is responsible for the delayed development on the Recognition of Assumptions

subscale, it would seem that this argues for expecting to find a relationship with curriculum progress.

But, could it be that both the slower and faster progressing students were both far enough along in the

curriculum to be exposed sufficiently to this thinking process as it was required by the recognition test?

We think so, and can note that the variance in the rate of progression of our students was not

inordinately large.

In explaining the relationship of the Test of Thematic Analysis to progress in the curriculum, we

conversely focus on its characteristic of being a production measure. We believe these curriculum-linked

changes on the Test of Thematic Analysis and its single item criterion-scores are consistent with an

interpretation that focuses on the assumed relationship between production measures and the

measurement of either broadly based mastery or habituality. More specifically, we suspect that mastery

of a way of thinking or habituality in a way of thinking develops incrementally and would not be

sensitive to minimal curriculum exposure. As a result, a range of progress in the curriculum would be

likely to be associated with differential performance on a production measure, which is what we found.

The response format of the Test of Cognitive Development is not purely either recognition or

production, and so it is inappropriate to speculate on how the results for this measure reflect upon our

thesis that production measures capture more habitual tendencies or more well-practiced abilities. We

do, however, gain greater confidence in this measure as a measure of college outcomes because of its

relation to progress in the curriculum. Furthermore, the historical development of this measure in the

Research on Formal Operations does suggest the instrument as a measure of sophisticated and well-

practiced thinking.

Inter-Relation of Critical Thinking Measures/Background

Variables

More generally, we find other evidence that these various instruments are measuring different aspects

of higher-order thinking. First, the Test of Thematic Analysis (the purest production measure) is related

to Age, but the Test of Cognitive Development is not. Second, the Test of Cognitive Development and

the Critical Thinking Appraisal subscales are related to high school GPA at entrance, but the Test of

Thematic Analysis is not. Finally, although these measures are correlated with one another positively

(see Table 6), previously reported factor analyses (see Mentkowski & Strait, 1983) supports distinctions

between the summated measures of critical thinking for the first two assessments.

Longitudinal Assessment of Critical Thinking in College Page 10

Comparison with Cross-Sectional Results

We are compelled to offer another word of caution. Our own cross-sectional analyses would have led to

different conclusions than our longitudinal analyses. The cross-sectional analyses, which controlled for

attrition, showed a “gain” only on the “Defense” score for the Analysis of Argument instrument, F(1,125)

- 7.26, p<.01, and the Inference subscale of the Critical Thinking Appraisal, F(1, 127) - 12.75, p<.001.

Moreover, because the cross-sectional comparison involves the comparison of existing groups (which

differed on high school GPA, for example), even these “gains” would have been potentially spurious.

When we did control statistically for high school GPA in our cross-sectional comparison, the cross-

sectional Inference “gains” were no longer statistically significant. We place greater faith in our

longitudinal changes, of course. We point up this discrepancy between the longitudinal and cross-

sectional results, because we believe it is an object lesson in the problems of interpretation from cross-

sectional findings, even though each method has its place.

Interpreting the New Production Measures

The task of more specifically interpreting changes on Test of Thematic Analysis remains. We believe the

decline on the Test of Thematic Analysis summated measure (5 of 9 criteria) can be attributed to the

decline on the examples criterion. We interpret the tendency of students to give less examples, but to

make more exceptions, as a positive outcome. The increasing tendency of students to make exceptions

in their analysis of the essays suggests that the thinking of the students may be becoming more abstract.

The decreasing tendency of the students to give examples of their abstractions seems somewhat

puzzling at first. In order to explain this, we make the following observations.

Faculty at Alverno College are pioneers in the writing across the curriculum movement. They explicitly

teach students to write for a particular audience, and they have developed standards of performance on

this dimension. Since the format of the Test of Thematic Analysis suggests that researchers are the

audience, we imagine them increasingly reasoning as follows as they repeatedly encounter the Test of

Thematic Analysis: “These researchers assuredly are aware of what the four sentence essays say. I don’t

need to give examples from these essays to make my points, because they already know the examples.

They would be bored by the obviousness of the examples, which would be about as long as the set of

essays.”

We believe styles of writing developed across school programs, as well as across disciplines, may vary.

Thus, we encourage a complex interpretation of the decline and gain on the Test of Thematic Analysis

scores. Future modifications of the instrument should consider, at a minimum, providing directions

which identify a more explicit audience

Although the findings for the Analysis of Argument instrument cast some doubt upon its suitability for

use at this college as an outcomes measure, we believe modifications in the instrument may overcome

these difficulties.

Longitudinal Assessment of Critical Thinking in College Page 11

In regard to the instructions for the “Defense” essay, if they advised against role playing, this might

improve the “Defense” essay’s measurement properties. In effect, students are penalized for role

playing “unsophisticated” thinking, and for empathizing with the perspective of the essay. In regard to

the “Attack” essay’s instructions, we note that they do not specifically request students to analyze the

adequacy of the arguments in the stimulus essay. Instead the instructions ask the students to “argue

against” the stimulus essay. The stimulus essay was an emotional one-sided essay that presented an

unsupported position likely to be counter to the attitudes of the students. In their argument against this

stimulus essay, the students in the present study predominantly tended to present the opposing

position, which would be their own opinion, as opposed to attacking the logic of the essay. The

emotional one-sided stimulus essay may have elicited an emotional need to present the opposing

position.

Thus, with respect to the lack of change on the Analysis of Argument “Attack” scores, one might

speculate that these students have not specifically and habitually learned to give precedence to a logical

analysis of the flaws in their opponent’s arguments; instead, their predominant tendency may have

been to give a positive statement of the opposite position. If the instructions had specifically requested

an analysis of the merits and flaws of the stimulus essay, we may have found developmental differences

on this measure. We are proposing a distinction here between the ability to write, when requested, an

intellectual critique in an emotional environment versus the habitual tendency to write a purely

intellectual critique in an emotional environment.

Implications for Standardization of Measures

We note that Winter, McClelland, and Stewart (1981) reported both longitudinal and cross-sectional

gains for the Analysis of Argument and Test of Thematic Analysis at “Ivy” College. These findings

contrast somewhat with our findings at Alverno College, even though we used the same scoring

procedures. We tend to discount the cross-sectional gains for Analysis of Argument we found because of

our lack of longitudinal results, while we feel encouraged by our perhaps unique bi-directional

longitudinal changes on the single criterion-scores of the Test of Thematic Analysis. Although we would

like to have been able to demonstrate the robustness of the standard scoring and administration

procedures developed at “Ivy” college, we have not been able to show cross-college equivalence for

these measures.

The likelihood that different educational strategies are used in different colleges has wide implications

for using scoring schemes that are assumed to be standard, and requires further consideration. For

instance, colleges may differ in how writing is taught and in how often role playing is pedagogically used,

and these and other stylistic differences in instruction may cause students to interpret differently the

instructions on these production measures. This confounding may greatly complicate the cross-

institutional measurement properties of these measures. Not only may differing educational contexts

elicit different interpretations of how to perform well on an instrument, but also these different

interpretations may be equally reasonable if the instrument itself does not communicate to the student

the preferred type of performance.

Longitudinal Assessment of Critical Thinking in College Page 12

It is also possible that these differing educational strategies reflect differing conceptions of critical

thinking outcomes. For example, at Alverno, role playing is often explicitly used to encourage students

to take tolerantly the perspective of another person or culture. Students are encouraged to consider the

cultural or personal assumptions they make. In contrast, the scoring procedures standardized by “Ivy”

college for the Analysis of Argument “Defense” essay is based upon another conception. The ideal

student portrayed in these scoring criteria. will insist upon maintaining a position consistent with her

own, even as she obligingly defends the good points present in an opponent’s position. There is no

necessary recognition of the cultural supports for one’s own beliefs or of the perhaps equal validity of

another’s views. Another example of possibly different conceptions of critical thinking can be illustrated

with the scoring of the “Attack” essay. The “Ivy” college scoring procedures reflect a preference for a

logical analysis of the stimulus essay. The present college, however, also encourages students to express

their own values and positions as well as encouraging logical analysis.

If either the conception of critical thinking differs across colleges, or, else, the students’ interpretation

based upon their educational context differs, even the direction of “gain” or “decline” on a criterion may

also differ. This concern suggests that a single college population—however reputable—may not serve

well as the major source for standardizing scoring criteria.

In many instances, it would be desirable to try to insure that these production measures elicit similar

interpretations of the task by students from different colleges. Here a distinction needs to be made

between the respondent’s ability to conform to a standard versus the respondent’s predominant

tendencies in unstructured situations. We do not rule out the development of several standardized

scoring systems, or even the particularization of scoring systems to a college’s conception of critical

thinking. What we earnestly recommend for production measures of ability, however, is that they offer

to the respondent a clear idea of the standards that will be applied to their productions, if the

researcher intends to represent their performance as their ability to conform to a standard.

A more unstructured stimulus may be used when the researcher is interested in the predominant

tendencies of the respondents. In such cases, the researcher reeds to be especially sensitive to the

distinction between necessary habits supporting an ability and stylistic divergence. We feel that the

Analysis of Argument and Test of Thematic Analysis do elicit habitual tendencies, but are not yet

sensitive to stylistic divergence. Their instructions draw forth the respondent’s habitual tendencies, but

the standardized scoring reflects a preferred response that does not accommodate alternative modes

and styles of critical thinking.

In general, we recommend fielding both types of production measures, those measuring the ability to

demonstrate mastery to standards and those measuring habitual tendencies. We also recommend

keeping these types of measures separate and distinct. Otherwise, it may not be possible to unconfound

them.

Longitudinal Assessment of Critical Thinking in College Page 13

Educational/Scientific Importance of the Study

Researchers are encouraged by these results to continue development of measures which ask students

to show the process of their thinking in addition to showing comprehension of knowledge, concepts or

generalizations. Critical thinking appears best understood as comprised of several dimensions, which at

this college develop differently across the college years. Researchers developing production measures

need to distinguish in their measurement between habitual tendencies versus the ability to show

mastery to a standard when requested. At present, the Analysis of Argument and Test of Thematic

Analysis tend to be similar to projection measures in that they do not specifically guide responses and,

as a result, tend to measure habitual tendencies. Our results suggest that these instruments need

further refinement if they are to be used as standardized cross-college outcome measures. We were not

able to show their scoring criteria to be internally reliable.

We remain confident in the potential usefulness of the Test of Thematic Analysis and Analysis of

Argument to a wide variety of colleges once they have undergone further development to either tailor

them to the institution’s own educational goals and definitions, or to improve their generic cross-

institutional equivalence. Even at this early stage of development, we have found the findings from the

instruments useful in suggesting the habitual tendencies of our students. The Test of Thematic Analysis

appeared sensitive to changes in their habits of thought.

We urge educators to support researchers developing production measures and not to rely entirely on

traditional, though efficient measures that may not work well as measures of well-practiced mastery.

Production measures seem better suited to the measurement of the kind of mastery obtained by the

consistent practice of higher-order thought and of the kind of habituality of thought processes that

would lead to their use across situations. We have suggested that the Recognition of Assumptions

subscale may tap a relatively sophisticated thought process, but we also note that some sophisticated

thought processes, for example, as used in the integration of perspectives, may require a production

measure. We encourage researchers to broaden their scoring schemes to include the kind of critical

thinking practiced by adults already in the working world (Arlin, 1975; McClelland, 1973).

Educators should be encouraged by our conclusion that critical thinking develops in college. The Alverno

students not only showed gains on the Piagetian-based Test of Cognitive Development, but also their

performance on this measure has been linked with their prior progress in the curriculum (cf.

Mentkowski & Strait, 1983). These students also showed gains on each of the three subscales of the

Critical Thinking Appraisal that we fielded. This finding is tempered somewhat by our inability to show a

direct relationship between performance on this measure and progress in the curriculum. Performance

on the Test of Thematic Analysis did appear to be linked to progress in the curriculum. In this regard, the

Alverno students may have developed the habit of making more exceptions or qualifications to their

analyses. Although they also may have weakened in their habit of giving examples to their analyses, we

suspect this finding is context bound.

Longitudinal Assessment of Critical Thinking in College Page 14

So far, we are able to support the usefulness of only two of three new measures of critical thinking. We

are cautioned in interpreting these study results. For example, although this study generally confirmed

the usefulness of the Test of Thematic Analysis, a comparative cross-sectional study of seven colleges

was able to demonstrate change on the Test of Thematic Analysis for only two of the seven: “Ivy”

College, a famous and highly selective one and Alverno College (see Winter, McClelland, and Stewart,

1981). Why? We have already noted that the instrument criteria and instructions were standardized by

a single institution, “Ivy” college We also note that the findings for Alverno reported in that study, which

were based upon a cross-sectional comparison of entering versus graduating students, have not been

confirmed by our longitudinal results. But, even if the summated results from “Ivy” college and the

results from the present analysis of the scoring criteria for the Test of Thematic Analysis are taken as

suggestive of possible differences, we have yet to demonstrate robust cross-college findings. We note

that at Alverno College, faculty have identified critical thinking abilities as college outcomes and

developed teaching strategies and instruments to teach and assess them. Perhaps, colleges cannot

expect change on production measures without such an explicit curriculum or without highly selective

student bodies. If so, it means that researchers and educators must work together at both instrument

and curriculum development at a range of colleges.

Longitudinal Assessment of Critical Thinking in College Page 15

REFERENCES

Arlin, P. (1975). Cognitive development in adulthood: A fifth stage? Developmental Psychology, 11(5), 602-606.

Ewell, P. (1984). The self-regarding institution: Information for excellence. Boulder, CO: National Center for Higher

Education Management Systems.

Fredericksen, N. (1984). The real test bias: Influences of testing on teaching and learning. Paper presented at a

conference on Teaching Thinking Skills, Wingspread Conference Center, Racine, WI.

Lipman, M. (1984). Philosophy and the cultivation of reasoning_. Paper presented at a conference on Teaching

Thinking Skills, Wingspread Conference Center, Racine, WI.

Marchese, T. (1985). Learning about assessment. AAHE Bulletin, 38(1), 10-13.

McClelland, D. (1973). Testing for competence rather than for “intelligence.” American Psycologist, 28, 1-14.

Mentkowski, M., & Doherty, A. (1984). Abilities that last a lifetime: Outcomes of the Alverno experience. AAHE

Bulletin, 36(6), 1-6, 11-14

Mentkowski, M., & Doherty, A. (1983, revised 1984). Careering after college: Establishing the validity of abilities

learned in college for later careering and professional performance. Final report to the National Institute of

Education: Overview and summary. Milwaukee, WI: Alverno Productions.

Mentkowski, M., & Strait, M. (1983). A longitudinal study of student change in cognitive development, learning

styles, and generic abilities in an outcome-centered liberal arts curriculum. Final report to the National

Institute of Education, research report number six. Milwaukee, WI: Alverno Productions.

Nickerson, R. (1984). Teaching thinking: What is being done and with what results? Cambridge, MA: Bolt Beranek

and Newman, Inc.

Paul, R. (1984). The concept of critical thinking: An analysis, a global strategy, and plea for emancipatory reason.

Rohnert Park, CA: Sonoma State University Center for Critical Thinking and Moral Critique.

Renner, J., Fuller, R., Lockhead, J., Johns, J., Tomlinson-Keasey, C. & Campbell, T. (1976). Test of Cognitive

Development. Norman, OK: University of Oklahoma.

Sigel, I. (1984). Reflection on thinking about thinking: The educational discovery of the 80’s? Paper for presentation

at a conference on Teaching Thinking Skills, Wingspread Conference Center, Racine, WI.

Sternberg, R. (1983). How can we teach intelligence? Philadelphia, PA: Research for Better Schools, Inc.

Stewart, A. (1977a). _Analysis of argument:, An empirically-derived measure of intellectual flexibility. Boston:

McBer and Company.

Stewart, A. (1977b). Scoring manual for stages of psychological adaptation to the environment. Unpublished

manuscript, Department of Psychology, Boston University.

Watson, G., & Glaser, E. (1964). Critical Thinking Appraisal. New York: Harcourt, Brace, Jovanovich.

Longitudinal Assessment of Critical Thinking in College Page 16

Winter, D. (1976). The Test of Thematic Analysis. Boston: McBer and Company.

Winter, D., & McClelland, D. (1978). Thematic analysis: An empirically derived measure of the effects of liberal arts

education. Journal of Educational Psychology, 70, 8-16.

Winter, D., McClelland, D., & Stewart, A. (1981). A new case for the liberal arts: Assessing institutional goals and

student development. San Francisco: Jossey-Bass.

Longitudinal Assessment of Critical Thinking in College Page 17

Table 1: Inter-Item Reliability, Chronbach’s Alpha, for Each Time

of Assessment1

Measure Time 1 Time 2 Time 3

Test of Cognitive Development NA .472 .54

Test of Thematic Analysis (all 9criteria) .28 .11 .19

Test of Thematic Analysis (5 of 9criteria) .36 .17 .35

Analysis of Argument Attack items .33 .17 .09

Analysis of Argument Defense items .79 .46 .75

1 For the reliability analyses, each scoring criteria was used as an item. For the Test of Cognitive

Development, 5 scores were used as items. For the Analysis of Argument “Attack” essay, 5 scoring

criteria were also available. For the Analysis of Argument “Defense” essay, 4 criteria were coded, but

only 3 were used in the reliability analysis because of multicollinearity.

NA The reliability coefficient is not available because only total scores were keypunched.

2 At time 2, the Test of Cognitive Development reliability coefficient is based only on the 1977 cohort,

because only the total score for 1976 cohort was keypunched.

Longitudinal Assessment of Critical Thinking in College Page 18

Table 2: Age by Time Repeated Measures ANOVAs for Linear and

Quadratic Contrasts

Measures Age

Main Effect

Linear Time

Main Effect

Quadratic Time

Main Effect

Test of Cognitive Development F(1,189) < 1 F(1,189) = 14.7*** F(1,189) = 1.7

Test of Thematic Analysis

(5 of 9 criteria) F(1,192) = 22.4*** F(1,192) = 3.9* F(1,192) = 5.1*

Test of Thematic Analysis exception F(1,192) = 3.6 F(1,192) = 14.0*** F(1,192 < 1

Test of Thematic Analysis example F(1,192)-16.4*** F(1,192) = 16.6*** F(1,192) < 1

Inference Subscale of Critical

Thinking Appraisal F(1,180) = 6.6* F(1,180) = 19.1*** F(1,180) < 1

Recognition of Assumptions Subscale

of Critical Thinking Appraisal F(1,180) = 4.9* F(1,180) = 4.1* F(1,180) = 6.8*

Deduction Subscale of Critical

Thinking Appraisal F(1,179) < 1 F(1,179) = 19.6*** F(1,179) < 1

1 Multivariate analysis of variance testing combined effects of linear and quadratic quadratic time were

statistically significant for all reported main effects. Both linear and quadratic tests of the Age by

Time interaction failed to reach to reach statistical significance (all F values less than 1).

* p <.05

** p < .01

*** p < .001

Longitudinal Assessment of Critical Thinking in College Page 19

Table 3: Means for Linear Time Main Effect Ignoring Age

Measures Time 1 Time 2 Time 3

Test of Cognitive Development 11.45 12.24 12.37

Deduction Subscale of Critical Thinking Appraisal 16.10 16.64 17.16

Longitudinal Assessment of Critical Thinking in College Page 20

Table 4: Means for Time of Assessment Broken Down for the Age

Main Effect

Measure Time 1 Time 2 Time 3

Test of Thematic Analysis (5 of 9 criteria)

Age 17 to 19 1.09 1.19 .91

Age 20 to 55 1.56 1.57 1.32

Test of Thematic Analysis, exception

Age 17 to 19 .25 .36 .48

Age 20 to 55 .37 .48 .49

Test of Thematic Analysis, example

Age 17 to 19 .32 .22 .16

Age 20 to 55 .50 .41 .29

Inference Subscale of Critical Thinking Appraisal

Age 17 to 19 8.97 9.56 9.58

Age 20 to 55 9.87 10.45 10.93

Recognition of Assumptions Subscale of Appraisal

Age 17 to 19 10.96 10.52 11.18

Age 20 to 55 11.26 11.35 12.01

Longitudinal Assessment of Critical Thinking in College Page 21

Table 5: Correlation of Critical Thinking Measures with Progress

in the Curriculum On “Competence Level Units” (CLU’s)1 and On

Credits Achieved

Measures Time 2

CLU's

Time 2

Credits

Time 3

CLU's

Time 3

Credits Test of Cognitive Development .21** .15* .03 .08

Test of Thematic Analysis (5 of 9 criteria) –.20** –.31*** –.17** –.29***

Test of Thematic Analysis, exception –.04 –.09 .10 .13*

Test of Thematic Analysis, example –.21** –.29*** –.10 –.25***

Inference Subscale of Critical Thinking Appraisal .05 –.07 –.08 –.15*

Recognition of Assumption Subscale of Critical Thinking

Appraisal .09 –.03 –.11 –.07

Deduction Subscale of Critical Thinking Appraisal .04 .06 –.11 –.02

1 Alverno Students demonstrate to criteria on over 100 assessments each of 8 broad abilities, which

have been set by the faculty (e.g. communication, analysis, problem solving, etc.). It is these

assessments that lead to the credentialing of the student on the sequential and developmental

arranged “competence level units.”

* p < .05

** p < .01

*** p <.001

Longitudinal Assessment of Critical Thinking in College Page 22

Table 6: Correlations between the Critical Thinking Measures at

Time One

Measures

Test of Cognitive

Test of

Cognitive

Development

CTA

Inference

Subscale

Recognition of

Assumptions

Subscale

CTA

Deduction

Subscale Test of Cognitive Development .28*** .21** .35***

Test of Thematic Analysis (5 of 9 criteria) .38*** .20** .16* .29***

Test of Thematic Analysis, exception –.05 .10 .17** .06

Test of Thematic Analysis, example .24** .08 –.01

.17*

Inference Subscale of Critical Thinking

Appraisal .29*** .29***

Recognition of Assumptions Subscale of

Critical Thinking Appraisal

.38***

Deduction Subscale of Critical Thinking

Appraisal

* p < .05

** p < .01

*** p <.001

Longitudinal Assessment of Critical Thinking in College Page 23

Appendix A: Test of Thematic Analysis Criteria Included Versus

Not Included

Criteria Included in 5 of 9 Total Score

Making “Direct Compound Comparisons” positively scored

Giving “Examples” positively scored

Using an “Analytic Hierarchy” positively scored

“Redefinition” for Scope or Clarity positively scored

Comparing “Apples and Oranges” negatively scored

Criteria Excluded From 5 of 9 Total Score

Making “Exceptions” or “Qualifications” positively scored

“Subsuming Alternatives” positively scored

“Affect” negatively scored

“Subjective Reaction” negatively scored