scott marion center for assessment ccsso june 22, 2010 ensuring, evaluating, & documenting...

SCOTT MARIONCENTER FOR ASSESSMENT

CCSSO JUNE 22, 2010

Ensuring, Evaluating, & Documenting Comparability of

AA-AAS Scores

What is comparability?

In an assessment context, comparability means that the inferences from the scores on one test can be psychometrically related to a score on another “comparable” test. In other words, we could consider the scores

interchangeable from two comparable tests.

Marion. Center for Assessment

Why do we care about comparability?

In fully individualized assessments, we don’t, BUT we need scores to be comparable when… Judging two or more scores against a common

standard, Aggregating scores to the school or district

level, (we are assuming that scores are comparable)

Judging scores for the same students and/or the same schools across years.


Comparability and Flexibility

Flexibility or individualization can pose challenges to comparability.

Using the same items and the same (extended) content standards each year would appear to ameliorate any comparability concerns. But, not everything is as it appears…issues with

“teaching to the test” threaten comparability. Obviously, completely individualized tasks

addressing non-systematic selection of standards raises considerable comparability concerns.


Traditional Methods

Scaling is simply placing raw scores on a numerical scale intended to reflect a continuum of achievement or ability so that similar scale scores have similar meaning across tests (Peterson, Kolen, & Hoover, 1989).

Linking describes a family of approaches (including equating) by which we can place the scores from one assessment on the same SCALE as another assessment (e.g., putting the 2006 scores on the 2005 scale).


Scaling Requirements

We can create scales from many different types of raw scores, but for the scale to lead to valid inferences, the original raw scores must have a similar conceptual foundation (i.e., the raw scores should be derived from similar assessment experiences, unless we move to a normative approach).


Linking (Equating) Requirements

There is a fairly extensive literature regarding the requirements for valid equating. Depending on content and contextual relationships between the two tests, the linking could be as strong as formal equating.

If equating assumptions are not met, calibration, projection, or even judgmental methods could be applied to connect the two sets of test scores. It is challenging for AA-AAS to meet many of the

assumptions necessary for strict equating


Mislevy on Linking (1992)

In, Linking Educational Assessments (1992), Mislevy states, “The central problems related to linking two or more assessments are: discerning the relationships among the evidence

the assessments provide about conjectures of interest, and

figuring out how to interpret this evidence correctly” (p. 21).

A brief summary of the three most valid approaches to linking (in descending order of quality) follow


Equating (Mislevy, 1992, p. 21)

The linking is strongest (and simplest) if the two tests were designed from the same test blueprint and were designed to measure the same construct(s). The most common example is two or more forms of the same test. “Under these carefully controlled circumstances, the weight and nature of the evidence the two assessments provide about a broad array of conjectures is practically identical”.

It is a statistical method of relating the scores on one test to the scores on a second test in order to separate differences in item/test difficulty from changes in student achievement.


Calibration (Mislevy, 1992, p. 24).

If the two (or more) tests were not designed from the same test blueprint, but both have been constructed to provide evidence about the same type of achievement, then the scores can be related through calibration. “Unlike equating, which matches tests to one another directly, calibration relates the results of different assessments to a common frame of reference, and thus to one another only indirectly” (Mislevy, 1992, p. 24). There are several situations in which calibration is used. The two most common are: (1) constructing tests of differing lengths from essentially the same blueprint, and (2) using IRT to link responses to a set of items (e.g., item bank) built to measure the “same construct”Marion. Center for Assessment

Projection (Mislevy, 1992, p. 24)

"If assessments are constructed around different types of tasks, administered under different conditions, or used for purposes that bear different implications for students' affect and motivation, then mechanically applying equating or calibration formulas can prove seriously misleading: X and Y do not 'measure the same thing.”

Mislevy's concern here is that the two assessments measure qualitatively different information. While it might make sense to administer the two tests to students, just because the two tests are correlated does not mean we should try to link them.


Current Operational Tests

Most current operational program rely on statistical links to relate scores from one year to another under the assumption that the connection is strong enough to conduct some form of equating


We Give the Same “Test” Every Year

Comparability is relatively easy—almost a non-issue, but what do you do when… you have to replace “tired” or poorly functioning

items? you find out that there is score inflation due to

“teaching to the test”?

Teaching to the test could become an issue—long history of this with regular assessments This would be a threat to valid comparability and

accountability validity


Is the “same” really the “same”?

What if you introduce new items to your supposedly common form?

How many new items can your test absorb before you will feel the need for formal equating? Do not just think of this as a one-year issue—a

little here, a little there, and pretty soon you have a new test

Once you replace 5-10% of the items or so, you should think about formal linking/equating


Issues of Flexibility/Standardization

Common or unique itemsCommon or unique indicators (finer grain

than standards)Common or variable formsUnique or common scoring


But, we have considerable flexibility…

Flexible items/same standards Alignment methods such as the Porter-Smithson

approach—aligning to common standard—might work

Other judgmental approaches

Flexible items/flexible standards Judgmental methods only


A Few Ways to Establish “Comparability”

Establish construct comparability based on similar content

Establish comparability based on similar or compensatory functionality

Establish comparability based on judgments of relatedness or comparability


Comparability based on similar content

Establish construct comparability based on similar content – for example, one assessment item taps the same construct as another assessment item. This may be based on a content and/or cognitive analysis

This approach needs to be documented and defended in terms of the process and the results


Similar or compensatory use

Establish comparability based on similar or compensatory use – distributional requirements often specify profiles of performance will be treated as comparable; total scores based on a compensatory system do similarly.

In other words if students perform similarly, as a group, on one set of items compared to another, they may be treated as comparable.

This is a much weaker connection than the content or cognitive analysis approach


Judgments of comparability

Establish comparability based on judgments of relatedness or comparability – disciplined judgments may be made to compare almost anything in terms of specified criteria (e.g., is this bottle as good a holder of liquid as this glass is?). Decision-support tools and a common universe of discourse undergird such judgments.

Obviously, this approach should be used when either of the other two approaches cannot be used


Performance categories

While the item-based approaches to comparability are the most appropriate, more holistic judgments can be made at the performance category level

In other words, is there evidence that two students designated as proficient, for example, have comparable academic knowledge and skills?


Summary

AA-AAS pose significant challenges to comparability which, in turn, poses challenges to validity of score inferences across students and/or across years

We cannot blindly employ statistical procedures and “pretend” to equate when we haven’t met many of the assumptions…

We must articulate a clear rationale for our approaches to comparability and document the methods and results as described in this presentation


For more information

Scott Marionwww.nciea.org


scott marion center for assessment ccsso june 22, 2010 ensuring, evaluating, & documenting...

Documents