nwea technical manual 2003

69
Technical Manual for the NWEA Measures of Academic Progress and Achievement Level Tests September 2003

Upload: joanne

Post on 07-Apr-2015

615 views

Category:

Documents


0 download

DESCRIPTION

NWEA is the Northwest Evaluation Association. NWEA is a non-profit that makes the adaptive on-line "benchmark" test known as MAP (Measures of Academic Progress.) this product is in use in Seattle Public Schools. There are parents who have concerns about this test being used in SPS, and the manner in which this product was selected.

TRANSCRIPT

Page 1: NWEA Technical Manual 2003

Technical Manual for theNWEA Measures of Academic Progress

and Achievement Level Tests

September 2003

Page 2: NWEA Technical Manual 2003

© 2003 Northwest Evaluation Association, Portland, Oregon

Copyright © 2003 Northwest Evaluation Association

All rights reserved. No part of this manual may be reproduced or utilized in any form or byany means, electronic or mechanical, including photocopying, recording, or by anyinformation storage and retrieval system, without permission in writing from NWEA.

NWEA12909 SW 68th Parkway, Suite 400Portland, Oregon 97223

Phone: (503) 624-1951Fax: (503) 639-7873

Website: www.nwea.org

General Info: [email protected]

Page 3: NWEA Technical Manual 2003

© 2003 Northwest Evaluation Association, Portland, Oregon

Table of Contents

Foreword ............................................................................................................. 1

Acknowledgments ............................................................................................... 2

Introduction......................................................................................................... 3Other Resources ...................................................................................................................................... 3Organization of This Manual.................................................................................................................. 4Covalent Assessments............................................................................................................................. 5

Educational Assessment ...................................................................................... 6The Principles That Guide NWEA Assessments................................................................................... 6MAP and ALT in a Comprehensive Assessment Plan .......................................................................... 9

Testing for a Purpose: NWEA Assessment Design .............................................. 9Test Structure .......................................................................................................................................... 9

Achievement Level Tests (ALT)..................................................................................................... 10Measures of Academic Progress (MAP)......................................................................................... 12

Testing Modality ................................................................................................................................... 15

The Item Banks and Measurement Scales .......................................................... 17The Measurement Model ...................................................................................................................... 17Scale Development and Item Calibration............................................................................................. 19

Phase I: Developing the Measurement Scales ............................................................................... 19Phase II: Maintaining the Measurement Scales ............................................................................. 20

Item Development................................................................................................................................. 21Item Writing and Editing ................................................................................................................. 21Item Bias Review ............................................................................................................................. 22Content Category Assignment......................................................................................................... 23Field Testing..................................................................................................................................... 23Item Analyses and Calibration......................................................................................................... 24

Periodic Review of Item Performance ................................................................................................. 27

The Assessment Process .................................................................................... 28Test Development ................................................................................................................................. 29

Test Design....................................................................................................................................... 29ALT Test Specifications ............................................................................................................. 29MAP Test Specifications ............................................................................................................ 31

Content Definition............................................................................................................................ 32Item Selection................................................................................................................................... 34Test Construction ............................................................................................................................. 35

Test Administration............................................................................................................................... 36

Page 4: NWEA Technical Manual 2003

© 2003 Northwest Evaluation Association, Portland, Oregon

The Testing Environment ................................................................................................................ 37Test Scoring and Score Validation ....................................................................................................... 40

Achievement Level Tests (ALT)..................................................................................................... 40Measures of Academic Progress (MAP)......................................................................................... 41

Report Generation ................................................................................................................................. 43Achievement Level Tests (ALT)..................................................................................................... 43Measures of Academic Progress (MAP)......................................................................................... 43

Results Interpretation ............................................................................................................................ 44

Customizing Assessments ..................................................................................46Localizing NWEA Assessments........................................................................................................... 46Making Global Use of Localized Assessments.................................................................................... 47

Operational Characteristics of the Assessments ..................................................49Validity .................................................................................................................................................. 51Reliability of Scores.............................................................................................................................. 54Precision of Scores................................................................................................................................ 56

Norm-Referenced Precision............................................................................................................. 58Curriculum-Referenced Precision ................................................................................................... 58

A Final Note on MAP and ALT Operating Characteristics ................................................................ 59

Appendix ...........................................................................................................60Initial Scale Development..................................................................................................................... 60Item Bank Overview ............................................................................................................................. 60

Glossary.............................................................................................................63

References .........................................................................................................64

Page 5: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 1

Foreword

In the constant debate about the role of assessment in education, one question is central:

How do we efficiently and accurately measure how muchstudents have achieved and how quickly they are learning?

NWEA was founded in 1976 by a group of school districts looking for practical answers tothis question. Since then, NWEA has developed assessment tools that enable educationalagencies to measure the achievement of virtually all their students with a great deal ofaccuracy in a short period of time. Two of these tools are the Measures of Academic Progress(MAP) and the Achievement Level Tests (ALT).

Both MAP and ALT are designed to deliver assessments matched to the capabilities of eachindividual student. In ALT, the assessment for a student is one of a set of pre-designedpaper-and-pencil tests that vary in difficulty. The test that the student takes is based on pastperformance or a short locator test. In MAP, the assessment for a student is dynamicallydeveloped and administered on a computer. The test adjusts to match the performance ofthe student after each item is given.

This manual details the technical measurement characteristics of the ALT and MAPassessments. These include item development, test development, the nature of themeasurement scales, and the appropriate use of scores from the assessments. Since manydifferent tests are developed within MAP and ALT (to match local content and curriculumstandards), this manual emphasizes those elements common to all of the tests developed inthe two systems.

Since MAP and ALT deliver a variety of tests to a particular group of students, theirmeasurement characteristics are somewhat different than those of a traditional single-formtest. Throughout this manual, we detail the aspects of MAP and ALT that make themunique. We also highlight the areas in which traditional statistical test evaluation proceduresneed to be enhanced to tell the whole story of these unique high-quality testing systems.

As with any technical manual, this one will not answer all questions. In fact, it is likely tospur as many new questions as it answers. For additional information, please consult theNWEA web site at www.nwea.org.

Allan OlsonG. Gage Kingsbury, PhD.

September 1, 2003

Page 6: NWEA Technical Manual 2003

2 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

Acknowledgments

Organizations do not write documents, people do. Therefore it is imperative to thank theindividuals who created this document and nurtured it from blank pages to its completedform. In this manual, many ideas have been drawn from nationally respected testingguidelines such as the APA/AERA/NCME standards, the Association of Test Publishers’guidelines, and the American Council of Education guidelines. The unique nature of theMeasures of Academic Progress (MAP) and Achievement Level Tests (ALT) also required thedevelopment of additional processes for evaluating the quality of tests. This developmentdepended on the contributions of the individuals mentioned below.

Many of the original ideas in this document came from Brian Bontempo of MountainMeasurement. Brian shaped the document and will hopefully see his vision in the finalmanual.

The manual has also enjoyed the thoughts, contributions, and review of the NWEAResearch Team, including Carl Hauser, Ron Houser, and John Cronin. They providedmuch of the information used in the manual and consistently worked to maintain theintegrity and accuracy of all information presented.

As with most NWEA documents, much of the inspiration and many of the best ideas andsuggestions have come from individuals in our member agencies and board of directors. Ifthis document is useful, it is due to these contributions.

Page 7: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 3

© 2003 Northwest Evaluation Association, Portland, Oregon

Introduction

This manual provides decision makers and testing professionals the technical informationnecessary to understand the theoretical framework and design of the NWEA Measures ofAcademic Progress (MAP) and Achievement Level Tests (ALT). It provides a technicaldescription of the psychometric characteristics and the educational measurement systems thatare embedded in the MAP and ALT assessments. It is not designed to be a technicaldescription or administration guide for the computer and information systems that developand deploy the MAP and ALT assessments. Please refer to the appropriate administrationdocumentation for this information.

This document is written for three primary audiences:

� Educators who are new to NWEA will find the document useful in evaluating theappropriateness of NWEA assessments for their setting.

� Educators currently using NWEA assessments will find the document useful inunderstanding and interpreting the results of the assessments.

� Measurement professionals will find information within this document to helpevaluate the quality of NWEA assessments.

Other ResourcesThis manual is one element of a series of documents that pertain to the MAP and ALTassessments. To obtain additional information, consult the Map administration trainingmaterials, The ALT Administration Guide, and the NWEA RIT Scale Norms document. Thesedocuments provide greater detail about how to administer the assessments and interpret theresults. Taken together, they provide a comprehensive description of the MAP and ALTassessment systems.

Parents, teachers, test proctors, curriculum coordinators, assessment coordinators, andadministrators should find the answers to many of their questions about the MAP and ALTassessments within the series of documents. Some readers may wish to seek additionalinformation on various topics such as NWEA, educational testing, or psychometric theory.References to additional sources of information are provided throughout this manual.

Page 8: NWEA Technical Manual 2003

4 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

Organization of This ManualIn an effort to provide useful information to all levels of practitioners, this documentintertwines both basic and advanced information into each section:

� The Introduction includes this description of the manual and an introduction to theuse of families of assessments that are bound together by shared measurementcharacteristics. These covalent assessments underlie NWEA assessments and muchof modern testing practice.

� Educational Assessment describes the characteristics that a good educationalassessment tool must have. It also discusses the psychometric principles consideredin the development of NWEA assessments.

� Testing for a Purpose: NWEA Assessment Design introduces the measurementprinciples used in the design of the assessments. It also describes the design of ALTand MAP tests.

� The Item Banks and Measurement Scales details the development of the item banksand measurement scales underlying the MAP and ALT assessments. It includesdiscussions of the item development process and the field testing process. Itconcludes with the theory and application of psychometrics to integrate the itemsinto the NWEA measurement scales. This section also includes informationconcerning Item Response Theory (IRT), the psychometric theory used in MAP andALT.

� The Assessment Process describes the basic steps of the assessment developmentprocess and explains how these steps are conducted in a psychometrically soundmanner. This section starts with the test development process and moves throughtest administration, scoring, reporting, and results interpretation.

� The section Customizing Assessments identifies the points in the development processat which an educational agency can customize the assessments to align with theirlocal curriculum and match their assessment objectives. Since the MAP and ALTassessments are designed to be both locally appropriate and globally comparable,there is a science behind the localization process followed for all MAP and ALTdevelopment. This section covers that science and explains the factors that agenciesmust consider when localizing their assessments. This section also includes anexplanation of how these localized assessments provide the capacity for globalcomparison.

� Operational Characteristics of the Assessments details the psychometric characteristicsof the NWEA assessments. This section includes evidence from a variety of studiesconcerning the validity, reliability, and precision of the scores from MAP and ALTtests.

Page 9: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 5

© 2003 Northwest Evaluation Association, Portland, Oregon

Covallent AssessmentsOne of the unique characteristics of the ALT and MAP assessments is that they are classes ofassessments that share common features. These classes are termed covalent assessments.Covalent assessments are defined as classes of assessments for which test scores areinterchangeable due to common item pools, measurement scales, and design characteristics.

NWEA has developed over 2000 MAP tests and 900 ALT series that share these commoncharacteristics. While the content specifications of the tests differ somewhat from oneeducational agency to another, the measurement characteristics are interchangeable. Thescores from one district can be readily compared to the scores in another district.

One interesting aspect of the development of technical specifications for a class ofassessments is that the outcomes tend to be more robust than those for a single test. Theinformation in this manual is applicable to all of the covalent assessments. It is alsoapplicable to tests that might be developed following a change in content standards orcurricular focus. As a result, educational agencies using covalent tests never lose continuity ofdata as they improve their curriculum and instruction.

A commonly asked question concerning ALT and MAP tests is “How is it possible for thescores on these assessments to be interchangeable, since students take different questions indifferent school districts and even in the same classroom?” To answer this question it isuseful to describe some of the characteristics that are common to all NWEA assessments.

All of the ALT and MAP tests share these characteristics:

� Common item pools in which all items are pretested before use.

� Common cross-grade measurement scales for all items in a content area.

� Common test design including numbers of items and distribution of difficulties.

� Common psychometric characteristics or psychometric characteristics that differ bydesign.

These shared characteristics allow scores from different covalent tests to be directlycomparable. This holds for students taking ALT and MAP tests across grades, across schools,across districts, and across test forms. A substantial amount of evidence concerning thestability of scores across different tests is included in the section Operational Characteristics ofthe Assessments.

This notion of interchangeable test items and interchangeable test scores is central to all theresearch concerning adaptive testing that has taken place over the past 30 years. The resultsof this research are now commonly accepted testing practice. Adaptive tests are now used inhigh-stakes testing settings such as medical licensure and certification, college entry, andarmed services placement examinations. For more information about adaptive tests and theinterchangeable use of different sets of questions in a variety of practical settings, seeDrasgow and Olson-Buchanan (1999).

Page 10: NWEA Technical Manual 2003

6 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

Educational Assessment

In a traditional assessment, all of the students in a single grade are given a single test form.This test is commonly designed with a wide range of item difficulties so that it is able tomeasure the achievement of most students to some extent. There are two common problemswith this type of test: non-optimal psychological characteristics and low measurementaccuracy.

Psychological characteristics. A traditional wide-range test includes items that span a widedifficulty range. Most students encounter a number of items that are too easy for them and anumber of items that are too difficult for them. These items tend to either bore the studentsor frustrate them. As a result, students may not be measured well by a wide-range testbecause of their psychological reactions to the test.

Measurement characteristics. Because of the design of a wide-range test, a studentencounters only a portion of the test that is challenging without being frustrated. This is theportion of the test that provides the most information about the student’s achievement level.Many of the items in a wide-range test provide less information than they would if they werecorrectly targeted near the performance level of the student. For example, a student who isachieving at a level lower than his classmates may only see a few items in a wide-range testthat he has the knowledge to answer. The answers to the other questions provide littleinformation about the student’s capabilities (except to indicate that the test was toodifficult).

To improve the psychological and measurement characteristics of a test, we can design a testthat increases the percentage of students who are challenged without being frustrated. This isthe underlying motivation behind the development of the NWEA MAP and ALT tests. Byimproving the fit of the test to the student, we can obtain more information about thestudent without causing boredom or frustration.

The Principles That Guide NWEA AssessmentsEducators need detailed information about individual student achievement. They use thisinformation to place students in appropriate classes and to form instructional groups withina class. They also use it to help parents and other stakeholders understand the strengths of aparticular student’s educational development as well as the areas that need furtherinstruction. When constructed and used properly, educational assessments are efficient toolsthat yield consistent, precise information concerning student achievement and growth.

Before discussing how NWEA assessments achieve these aims efficiently, it is useful todiscuss why reliability, validity, precision, and efficiency are important. To do this, it ishelpful to consider how a ruler is used to measure the length of a piece of wood. The

Page 11: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 7

© 2003 Northwest Evaluation Association, Portland, Oregon

principles and qualities of measurement in the physical world apply directly to measurementin education.

Reliability is a primary requirement of measurement. Reliability can be defined as theconsistency of measures obtained from a measurement tool. For example, when the length ofa small, straight piece of wood is measured with a ruler, one expects that additionalmeasurements of the same piece of wood with the same ruler will have similar results. Withrepeated observations, you can determine that the ruler measurements are consistent andreliable.

Educational assessments that yield consistent results with repeated testing or within differentportions of the same test are considered reliable. Those that provide inconsistent results areunreliable and of less use to educators. Therefore, one of the primary goals in creating aneducational assessment is to create an assessment that has reliable scores.

Validity is another quality that is important in designing and evaluating assessments. Validitycan be defined as the degree to which an educational assessment measures what it purports tomeasure. This attribute of a test and its scores is fundamental to the quality of themeasurement.

While many aspects of validity have been discussed in the past, the two most critical includethe degree to which:

� An assessment measures what is expected to be taught in the classrooms in which itis used.

� The scores from an assessment in a content area correspond to other indicators ofstudent achievement in that same content area.

While the relationship of MAP and ALT scores to other indicators of student achievementare discussed in the section Operational Characteristics of the Assessments, the relationship ofthe assessments to what is expected to be taught requires further discussion here.

The extent to which a measurement is an accurate, complete indicator of the quality to bemeasured defines its validity as an expression of the content being assessed. In educationalassessment, this is highly dependent upon the local curriculum. For example, a genericmathematics assessment may contain content that differs greatly from that which is expectedto be taught in the local classroom. On the other hand, a mathematics assessment designedto include content that is included in the local curriculum fits the content needs of theeducational agency more completely. The latter, more valid assessment is clearly morevaluable to educators in understanding student achievement.

Precision is a third important aspect of any assessment tool. Precision is the level of detailthat the assessment can render. A ruler capable of measuring length to the nearest 1/16 of aninch is more precise than a ruler that measures to the nearest inch. Think of a ruler that hastick marks every 1/16 of an inch. The actual length of a measured piece of wood issomewhere between the two nearest tick marks of the ruler. For example, between 8 5/16inch and 8 3/8 inch. With this ruler, the amount of error in any single measurement is lessthan 1/16 of an inch if the person taking the measurements is consistent.

Page 12: NWEA Technical Manual 2003

8 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

In educational assessment, precision or error can be thought of in the same manner. It iscommon to refer to the amount of error to be expected in a test score as the Standard Errorof Measurement (SEM). A more precise instrument (an instrument with a lower SEM)allows educators to see the difference between two similar students better than a less preciseinstrument. It is for this reason that measurement precision is valuable to educators, andtherefore one of the primary goals in designing an educational assessment.

Precision is affected by the amount of information obtained from each test question and thenumber of questions on a test. Test makers can optimize the precision available for a test of agiven length by developing a test that yields a high amount of information from each item.

Efficiency is defined as the degree to which a measurement tool can be used to perform themeasurement needed in the shortest possible period of time. If you use a ruler to measure thewidth of a piece of paper, it is a very efficient tool. If you try to use the ruler to measure thedistance from Cleveland to Canton, it is an inefficient tool for the task.

The number of questions and the types of questions on a test directly relate to the time ittakes students to complete the test. The precision of an assessment is also directly related tothe number of questions on the assessment. The more questions there are on an assessment,the more precise the results are, but the more administration time the assessment requires.

The degree to which the difficulty of questions on a test is matched to a student’sachievement level also directly affects the precision of the assessment. The better the match,the more efficient a test is for any required level of precision.

A test that provides precise results with a relatively small number of items is consideredefficient. Factors that affect efficiency include the quality of the questions on a test, and thetargeting of the difficulty of the questions on the test to the achievement level of a particularstudent.

With the increasing demands on schools to bring students to higher standards, it isimportant to spend as much time teaching students as possible. At the same time, educatorsneed more information about students to help them achieve more. Since additionalassessment is time consuming, the need for student achievement information must bebalanced with the need to minimize the time that students spend in assessment situations. Inorder to help educators meet these competing needs, assessment tools need to be as efficientas possible.

Page 13: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 9

© 2003 Northwest Evaluation Association, Portland, Oregon

MAP and ALT in a Comprehensive Assessment PlanOne approach to gathering student performance information is to include paper and penciltests such as ALT and computerized adaptive tests such as MAP in a comprehensiveassessment plan. These assessments incorporate a strong underlying measurement model anda large bank of test questions to allow choices that fit the local curriculum. This allows theMAP and ALT tests to meet or exceed any practical standards of reliability and contentvalidity. At the same time, MAP and ALT provide students with relatively short tests,preserving classroom time for instruction. Incorporated within a comprehensive assessmentprogram, ALT and MAP can provide extremely useful information at the student, classroom,school, district, and state levels.

Testing for a Purpose: NWEA Assessment Design

The NWEA assessments were designed using a small number of guiding principles. Eachprinciple relates to creating an assessment for a specific educational purpose. The principlesare:

� A test should be challenging for a student. It should not be frustrating or boring.

� A test should be an efficient use of student time. It should provide as muchinformation as possible for the time it takes to administer.

� A test should provide a reflection of a student’s achievement that is as accurate asneeded for the decisions to be made.

� A test should consist of content the student should have had an opportunity tolearn.

� A test should provide information about a student’s growth as well as the currentachievement level.

� A test should provide results to educators and other stakeholders as quickly aspossible while maintaining a high level of integrity in the reported results.

These principles are carried into the design of each NWEA assessment. This can be seenclearly in the choice of testing structure and testing modality.

Test StructureIn the past, educational tests included a single test form per grade level. This type of test iscommonly called a wide-range test because the questions have a wide variety of difficultylevels. A problem with this type of test is that in trying to give a sampling of content that isappropriate for many students, the test as a whole is appropriate for very few students.Almost any student taking a wide-range test encounters some items that are too easy, somethat are reasonably challenging, and some that are too difficult.

In any single grade level, the difference in achievement between low and high achievingstudents is quite large. In building a single test to assess all of the students in a grade, theremust be some extremely hard questions and some extremely easy questions. In fact, the range

Page 14: NWEA Technical Manual 2003

10 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

of difficulty is commonly large enough to reduce the degree to which the test is targeted toany single student. As a result, high achievers find this kind of test to be a snap, while lowachievers are frustrated by its difficulty.

To improve the efficiency of assessments, NWEA has developed two types of tests that targettest difficulty to student achievement. These two types of tests are Achievement Level Tests(ALT) and Measures of Academic Progress (MAP).

Achievement Level Tests (ALT)

The NWEA ALT assessments are different than wide-range grade-level tests. Rather thanconstructing one test for every grade, ALT consists of a series of tests that are aligned withthe difficulty of the content rather than the age or grade of the student. The items of a singleALT level have a small, targeted range of difficulty, designed to enhance the match of the testto the set of students who take it. The range of difficulty for a single ALT level is far smallerthan the range of difficulty for a wide-range test. This allows a lot of information to becollected concerning a student’s achievement, provided the correct level is administered.

Figure 1 compares the SEM from a wide-range test to that from an ALT series of eight levels.Since error and precision are inversely related, the test with the smaller SEM is the test withthe greatest precision. Figure 1 shows that the ALT series tends to result in less error than thewide-range test, across the entire measurement range of the tests. Figure 1 also shows that theALT test has a range in which student scores are measurable that is twice as large as the wide-range test.

While the SEM of the test score is an important characteristic, the student’s proportion ofcorrect responses is also important in identifying the usefulness of a test score. The verticallines in Figure 1 indicate the range in which the student’s score falls between near chanceperformance (25% to 30% correct, depending on the content area) and 92% correct.Beyond this range, the student will have very unstable scores since the test is far too easy orfar too difficult. Figure 1 shows that the ALT series has a much broader measurement rangewith reasonable precision than the wide-range test.

Page 15: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 11

© 2003 Northwest Evaluation Association, Portland, Oregon

Figure 1: Measurement error of typical wide-range test vs. ALT series

ALT assessments are designed to provide better targeting of test difficulty to studentachievement. Figure 2 shows the percentage of students receiving tests of different relativedifficulty. The example is based on 3,000 fourth-grade students taking a reading test inCalifornia. A test that has an average item difficulty equal to the student’s achievement isperfectly targeted. A test that is too difficult or too easy will not provide as muchinformation.

0

1

2

3

4

5

6

7

8

9

10

11

130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290

RIT score

Mea

sure

men

t E

rro

rL-1 L-8

MeasurableRange of

Typical WideRange Test(using same

percent correctrules)

WRWR

Measurable Range of ALT Series

Page 16: NWEA Technical Manual 2003

12 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

Figure 2 shows that ALT results in better targeting of the tests than using a single wide-rangetest. In this example, the average absolute difference between the item difficulty and thestudent achievement was 8.1 for ALT and 14.5 for the wide-range test. The greatesttargeting error seen for ALT was 23.4 RIT points (see the section The Measurement Modelfor information about the RIT scale), while for the wide-range test it was 57.0 RIT points.Figures 1 and 2 clearly show that the ALT system results in tests that are closer in difficultyto the achievement of the students and yields more accurate scores over a wider range ofachievement.

Figure 2: Comparison of item difficulty targeting in ALT and wide-range tests

Measures of Academic Progress (MAP)

The other testing structure used in NWEA assessments is computerized adaptive testingthrough MAP. The basic concept behind adaptive testing is relatively simple. Given a pool ofitems with calibrated item difficulties, a student is presented with an item of reasonabledifficulty based on what is known about the student’s achievement. After the student answersthe item, his or her achievement is estimated. The next item chosen for the student to see isone that is matched to this new achievement level estimate. If the student missed theprevious item, an easier item is presented. If the student answered the previous itemcorrectly, a more difficult item is presented.

This process of item selection based on all the responses that have been made by the studentrepeats itself again and again. With each item presented, the precision of the student’sachievement level estimate increases. This results in smaller and smaller changes in item

0.00

0.02

0.04

0.06

0.08

0.10

-60 -40 -20 0 20 40 60

Difference Between Student Score and Item Difficulty

Pro

po

rtio

n o

f S

tud

ents

Wide-Range ALT

Page 17: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 13

© 2003 Northwest Evaluation Association, Portland, Oregon

difficulty as the test pinpoints the student’s actual achievement level. This process continuesuntil the test is complete.

Figure 3 shows the student score and error band around the score as each item is presented ina MAP mathematics test. The characteristics of an adaptive test are very clear in this figure.Early in the test, the student’s score varies substantially from item to item and has a verywide error band. As the test continues, the student’s achievement level estimate stabilizes,and the associated error band shrinks substantially.

Figure 3: Student achievement level estimates and error bands following each item administered in a MAP mathematicstest

In the test shown in Figure 3, three achievement categories have been identified (basic,proficient, and advanced). Figure 3 illustrates that at the beginning of the test the error bandaround the score overlaps all three categories, indicating that no confident decision can bemade about the student’s actual category. By the end of the test, the error band has decreasedto the point that it is completely contained within the proficient category. At this point aconfident decision can be made concerning the student’s actual achievement category.

One advantage to administering MAP is that every student receives a unique test, customtailored to the performance of the student throughout the assessment. This means that thetargeting of item difficulty is typically even better than that of a particular ALT test. Thisalso means that the precision of MAP is greater than ALT, although the actual difference isquite small.

206

222

233 225

219 214 218

215 212 214

212 210 212

210 212 210 209

207 206 207

150 160 170 180 190 200 210 220 230 240 250

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Test Questions Administered

Stu

dent

RIT

Sco

re

Basic

Proficicent

Advanced

Page 18: NWEA Technical Manual 2003

14 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

MAP and ALT Measurement Error for Language Usage by RIT

Spring 2001

0

1

2

3

4

5

6

7

8

130 140 150 160 170 180 190 200 210 220 230 240 250 260 270

RIT Scale

ALT n=294672

MAP n=75226

Figure 4 displays the average SEM for scores from MAP and ALT for a set of language usageassessments. Notice that for the most part, MAP and ALT have nearly identical SEM values.This is in keeping with the underlying test design and measurement theory. Thedevelopment and implementation of computerized adaptive testing in education has beenresearched over the last 30 years (Weiss and Kingsbury, 1984) and has evolved to a pointwhere it is now an extremely useful testing approach for many organizations that have largeitem banks and a solid computer infrastructure for test administration.

Figure 4: Comparison of SEMs for MAP and ALT

Page 19: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 15

© 2003 Northwest Evaluation Association, Portland, Oregon

Figure 5 shows the targeting of the MAP tests (the difference between average item difficultyand student achievement) compared to a typical wide-range test. You can see that the MAPtests are substantially better targeted to student achievement than the wide-range test. Theaverage absolute difference between the student achievement level and the average difficultyof the items on the test was 2.0 RIT points, and the largest difference observed was 22 RITpoints. Comparing Figure 4 to Figure 2 shows that the MAP system results in bettertargeting than the ALT system.

Figure 5: Comparison of item difficulty targeting in MAP and wide-range tests

These figures show clearly that ALT and MAP provide almost all students with tests that arebetter targeted to their skill levels. As a result, the test scores from MAP and ALT are moreprecise than those from a comparable wide-range test. This is in keeping with several of theprinciples described at the beginning of this section.

Testing ModalityIn order to make this system even more efficient and informative, NWEA uses two differenttest modalities: paper and pencil delivery and computer delivery. An educational agencymust decide which testing modality is appropriate for them.

ALT tests are delivered in paper form. MAP tests are delivered via computer. Since theprocess for selecting items in an adaptive test requires rapid estimation of the student’sachievement and rapid calculation to select the item to be administered to the student, acomputer is required for MAP administration.

0.000.020.040.060.080.100.120.140.160.180.200.220.24

-60 -40 -20 0 20 40 60

Difference Between Student Score and Item Difficulty

Pro

po

rtio

n o

f S

tud

ents

Wide-Range MAP

Page 20: NWEA Technical Manual 2003

16 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

There are several factors an educational agency should consider in deciding which testingmodality is right for them:

� Availability of resources—Since MAP requires computers for test administration,organizations must have workstations available for use. ALT does not require eachstudent to be tested by computer, but requires a different set of resources, includinglogistic support for printing and distributing test materials, control systems forprocessing answer sheets, and disposal or storage procedures for test materialsfollowing testing.

� Time frame available for test administration—MAP can be simultaneouslyadministered to only as many students as there are computers available. There arerarely enough computers at a particular site to test all students simultaneously.Therefore, agencies must schedule test administration sessions with individual classesor students. ALT can be administered simultaneously to many students. Since theMAP test is created dynamically from a large item pool, however, the need forsimultaneous testing is not as great as with ALT.

� Time frame available for scoring—MAP scoring and results validation happen inless than 24 hours. ALT requires machine scoring that is done by either theorganization or NWEA. The ALT scoring process normally requires approximatelytwo weeks.

� Need to retest students with invalid tests—Sometimes (less than 6 percent of thetime) a student achieves a score outside the valid measurement range for the ALTlevel administered. In this situation, the student is retested with a level test moreappropriate to the student’s achievement or with the MAP system. Using the MAPsystem, retests are rarely needed and usually occur when hardware fails or thestudent becomes ill during testing.

� Desire to control item selection—ALT allows an organization to hand pick theitems for their ALT series. Since MAP builds a custom test for each student duringtest administration, organizations cannot control the exact content administered toeach student. Nonetheless, organizations choosing MAP do have a great deal ofcontrol over the types of items presented to students. This is due to the robustblueprinting and item selection process available within MAP.

By considering all of these factors, educational agencies can choose the testing modality thatis most useful to them.

Page 21: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 17

© 2003 Northwest Evaluation Association, Portland, Oregon

The Item Banks and MeasurementScales

To develop assessments that have a high degree of validity and reliability for a variety ofdifferent educational agencies, it is important to have large banks of high-quality test items.These items need to be connected to an underlying measurement scale using procedures thatensure consistency of scores from one set of items to another. This section outlines theprocess that NWEA follows to create items, incorporate them into the NWEA measurementscales, and maintain the scales and items over time. These processes and the theory behindthem allow the development of valid, reliable, and precise assessments from the NWEA itembanks. An understanding of these processes helps in interpreting and explaining the results ofthe assessments.

The Measurement ModelBefore discussing how items are added to the NWEA measurement scales, it is important tounderstand the measurement model that is used to create the scales. Item Response Theory(IRT) (Lord & Novick, 1968; Lord 1980, Rasch, 1980) defines models that guide both thetheoretical and practical aspects of NWEA scale development.

For the purpose of understanding these measurement models, consider again the analogy ofthe ruler used in the Introduction. The NWEA measurement scales are very much like aruler. The NWEA reading scale is a ruler of reading. With an actual ruler, the tick marks areplaced evenly and accurately on the ruler by using a standardized device such as an inch-longpiece of wood to draw each tick mark. Locating the tick marks for an educational scale suchas the reading scale is not this easy, but it is quite possible using IRT.

One important element of IRT is the mathematical model that grounds it. NWEA uses theone-parameter logistic model (1PL), which is also known as the Rasch model. The model isas follows:

This model is a probabilistic model that estimates the probability (P) that a student (j) willanswer a question (i) correctly given the difficulty of the question (bj) and the achievementlevel of the student (θ i). The constant, (e), is equal to the base of the natural log(approximately 2.718).

Page 22: NWEA Technical Manual 2003

18 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

A benefit of the use of an IRT model is that the student achievement levels and itemdifficulties are aligtned on a single, interval scale. This measurement scale is “equal interval”in nature, meaning that the distance between tick marks is equal, like a ruler. The units(equivalent to the ruler’s tick marks) are called log-odds units or logits. Logit values aregenerally centered at zero and can theoretically vary from negative infinity to positiveinfinity. To simplify interpretation and eliminate negative scores, the RIT (Rasch unIT) scalewas developed. Since the RIT score is a linear transformation of the logit scale, it is still anIRT scale and maintains all of the characteristics of the original scale. In order to calculate aRIT score from a logit score, a simple transformation is employed:

Y RITs = (X logits *10) + 200

Therefore, if a logit value is zero, then the RIT value equals 200. This scale has positivescores for all practical measurement applications and is not easily mistaken for othercommon educational measurement scales. The RIT scale is used throughout ALT and MAP.

IRT measurement scales have a number of useful characteristics when they are applied andmaintained properly. The characteristics that have been most important to the developmentof the NWEA measurement scales and item banks are:

� Item difficulty calibration is sample free. This means that if different sets ofstudents who have had an opportunity to learn the material answer the same set ofquestions, the resulting difficulty estimates for the items are equivalent aside fromsampling fluctuation.

� Achievement level estimation is sample free. This means that if different sets ofquestions are given to a student who has had an opportunity to learn the material,the student’s score is equivalent aside from sampling fluctuation.

� The item difficulty values define the test characteristics. This means that once theitem difficulty estimates for the items to be used in a test are known, the precisionand the measurement range of the test are determined.

These simple properties of IRT result in a variety of test development and deliveryapplications. Since IRT enables one to administer different items to different students andobtain comparable results, the development of targeted tests becomes practical. This is thecornerstone for the development of level testing and adaptive testing systems.

These IRT properties also enable the building of item banks with items that extend beyond asingle grade level or school district. This has enabled NWEA to develop measurement scalesand item banks that extend from elementary school to high school. By combining theseproperties of IRT with appropriate scale development procedures, NWEA has also developedscales and item banks that endure across time and generalize to be useful in a wide variety ofeducational settings.

Page 23: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 19

© 2003 Northwest Evaluation Association, Portland, Oregon

Scale Development and Item CalibrationThe development of the measurement scales and item banks has two primary phases. In theinitial phase, the boundaries for the scale are established, initial field testing of items isperformed, and scale values are established. The second phase includes maintenance of thescale, extension of content, and addition of new items.

Phase I: Developing the Measurement Scales

The development of each NWEA measurement scale follows the same set of steps:

� Identify the content boundaries for the measurement scale. During this step, avariety of content structures from different agencies for the domain of interest arereviewed and joined. They create a content index that is as detailed but broader inscope than any single description of the content area.

� Develop items that sample the content with a wide range of difficulty. Groups ofclassroom teachers are trained to write high-quality test questions and participate ina multi-day workshop to produce test items related to each element of the contentdomain.

� Develop the networked sampling design to be used in field testing. To assurerobustness of difficulty estimation, each item is included in several test forms toallow the item to be seen in a variety of contexts and positions. The network designused in most NWEA scale development is the four-square design (Wright, 1977).

� Identify samples of students appropriate for the items to be tested. Each test formis scheduled to be administered to 300-400 students in different classrooms, schools,and grades.

� Administer the field test. Students take the field tests in settings that mimic theactual test as closely as possible. Tests are teacher-proctored and presented withoutfixed time limits. Responses are entered on the same type of answer sheet used forthe operational tests.

� Tests for dimensionality. Once field tests are administered, samples of the testforms are used to investigate whether responses are effected by more than oneprimary dimension of achievement. These analyses include factor analyticprocedures and content area calibration procedures (Bejar, 1980).

� Estimate item difficulties. Once field test information has been collected, aconditional maximum-likelihood procedure is used to calculate item difficultyestimates for the items in each test form (Baker, 2001; Warm, 1989). This results ina set of estimates and fit statistics for each form in which a particular item appears.

� Test items for model fit. Fit statistics (point biserial, revised mean square fit) arecalculated for each item on each test form. In addition, the percentage of studentsanswering each item correctly and the percentage of students omitting each item arecalculated. Each form is then reviewed, and items are eliminated from further

Page 24: NWEA Technical Manual 2003

20 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

consideration if they have poor fit statistics or if they are answered correctly by toohigh or too low a percentage of the students in the sample.

� Triangulate item difficulties. The remaining items that appear on multiple formsare used in a process of triangulation to identify linking values that result in themost plausible difficulty values for each item across all of the forms (Ingebo 1997).This process results in a single difficulty estimate for any particular item that is thebest representation of the difficulty of the item across all of the forms. At thecompletion of this process, the difficulty estimates for all of the items administeredare identified along a single, underlying difficulty continuum.

� Apply Logit-to-RIT transformation. In the development of any IRT scale, a singlelinear transformation is allowed. This gives the scale the numerical characteristicsdesired by the developers. Once the items have been identified with triangulateddifficulty estimates, the linear transformation described earlier is used to transfer theitem difficulty values from the Logit scale to the RIT scale.

Once these steps have been carried out, the initial scale development is complete. NWEA hasused this same process to create its scales in reading, mathematics, language usage, generalscience, and science concepts and processes. More details about the original development ofthe NWEA scales can be found in Ingebo (1997).

Phase II: Maintaining the Measurement Scales

Once the initial phase of scale development has been completed, the scale can be used todesign tests, score student assessments, and measure student growth. At this point, thehighest priority shifts to maintaining the measurement scale. A primary element in thismaintenance is establishing scale stability. Items are always added to the item banks usingprocesses that assure that the original scale remains intact.

The process of calibrating the difficulty of a particular item to an existing scale can becompared to calibrating a new bathroom scale using objects of known weight. In the samemanner, adding new items to the original measurement scale requires the use of items withknown difficulty estimates. New items must be added to the scale in a way that does not disruptthe original scale. The field testing process is designed to collect data to permit this seamlessaddition of items to the original scale.

Newly-developed items are field tested by placing them into a test together with active items.Students’ scores are then estimated using only the active items to anchor them to the originalmeasurement scale. Using the fixed student achievement estimates, the difficulty of each of thedevelopmental items is estimated using a one-step estimation procedure. This procedure iscontained within a proprietary item calibration program designed for the purpose. Since theseitem difficulty estimates are based on student achievement level estimates from items on theoriginal scale, they are also on the original scale. This procedure allows virtually unlimitedexpansion of the item banks within the range of difficulty represented by the original items. Italso allows for the careful extension of the original measurement scale to easier and moredifficult test questions.

Page 25: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 21

© 2003 Northwest Evaluation Association, Portland, Oregon

By using IRT to create a scale and by anchoring the difficulty estimates of the items to thisscale, NWEA assures that the scale is constant from one set of items to another, and fromone set of examinees to another. This means that two students can receive different sets ofitems and the achievement estimates will connect to the common measurement scale. As aresult, educational decisions based on the results are comparable. Both MAP and ALT takeadvantage of this measurement characteristic of the scales to allow different students to takedifferent sets of test questions. This also allows longitudinal assessments of studentperformance to be made without the types of discontinuities that have plagued many testpublishers when they change their norms or re-center their scale scores.

Item DevelopmentThe item development process used by NWEA has been refined over the past 25 years as theitem banks in reading, mathematics, language usage, general science, and science conceptsand processes have expanded. The methods that NWEA uses to incorporate items into theNWEA scales and maintain the scales are in widespread use throughout the measurementcommunity, although particular aspects of the process are unique to the NWEA item banks.

There are four basic phases of item development:

� Item writing and editing

� Item bias review

� Item index assignment

� Field testing

After field testing, new items are analyzed for quality before they are made available for usein tests. If items have the statistical characteristics required by this quality analysis, they aremade active and may be used within MAP and ALT assessments. The item banks are alsomonitored to ensure that the items and scale perform in a stable manner from setting tosetting and from year to year.

This section explains each of the stages of item development, the methods used toincorporate the test items into the item banks, and the process used to maintain the qualityof the measurement scales and items over time.

Item Writing and Editing

The first phase of item development is simply writing the items. In order to write items foruse in the item banks, item writing workshops are held with groups of classroom teachers.These teachers may come from a number of different educational agencies and attend theworkshop for the explicit purpose of writing test questions. By using active teachers to writeitems, the items can more accurately reflect the curriculum actually being taught in theclassroom.

Page 26: NWEA Technical Manual 2003

22 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

At each workshop, the teachers are first taught the basic guidelines for constructing multiple-choice items (Haladyna, 1994; Osterlind, 1998; Roid & Haladyna, 1997). Writers areinstructed on item writing terminology and the need for clarity in the items. They are taughthow to make the stem of an item unambiguous and concise and are directed on how to writeparallel distracters. Writers are instructed to use only four response options for reading andlanguage usage and five options for mathematics and science. The teachers are encouraged towrite the stem of an item using positive wording and to use completely independentdistracters.

Following training, writers are assigned some general topic areas such as algebra orcomputation. They then write items in these areas that would be appropriate for the studentsin their classes. Teachers write approximately ten to fifteen items each day. As the items arewritten, they are exchanged with other teachers who check the items for technical accuracy,grammatical accuracy, completeness, readability, potential for bias, and clarity. Once theitems have been reviewed, the item writer specifies the grade range for which the item isappropriate and the grade range in which the item should be field tested.

After the item writing session, the items are reviewed and edited by NWEA staff. This reviewincludes editing the grammar, format, and style of the items. It also includes reviewing andediting the presentation of the items to allow seamless computer administration.Modifications that may impact the correctness of the item are also reviewed before the itemis moved on to the next stage of development, bias review.

Item Bias Review

The purpose of the bias review is to ensure that all of the items contained in the NWEA itembanks are written in such a way that students of all ethnicities, races, and genders willinterpret them in a similar manner. To ensure that items are free of race, ethnicity, andgender bias, two different steps are taken. During the item development process, otherteachers review the items for potential bias. The teachers look at each item to identifywhether it contains words or passages that do not pertain to the skill being tested but mightbe misleading or misunderstood by students with particular racial or ethnic backgrounds.

In addition, NWEA holds item bias review panels. The focus of these panels is to see that allstudents have a fair opportunity to answer the item correctly, without being distracted ormislead by the context of the item. During these events, many of the items in the item banksare reviewed by a panel of stakeholders from a variety of racial and ethnic backgrounds. Thispanel examines each item using the same guidelines used by the teachers who originallywrote them. Items that have potential bias are edited by the review panel or sent back to theoriginal item author to be revised. In addition, the panel may suggest wording changes—such as the names of individuals contained within the item text—that might help the set ofreviewed items reflect greater diversity. Occasionally items are rejected and removed fromactive use in the item banks.

Page 27: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 23

© 2003 Northwest Evaluation Association, Portland, Oregon

Content Category Assignment

NWEA also assigns every item to a content category. By doing so, the performance ofstudents in specific content areas can be tracked. This enables NWEA to relate studentperformance to the NWEA Learning Continuum and to provide agencies with reports thatcontain student performance within the various content categories.

The schemas for the content categories within each bank were developed by NWEA staff andcontent specialists. In building each content index, the content of the items within the bankwas analyzed, and a logical, hierarchical structure for the content of each subject wasdeveloped. This structure was based on an amalgamation of the curriculum structures usedin a variety of NWEA districts. From time to time, the content structures are updated asmore precise ways of differentiating the content of the items within a bank are discovered.They are also updated whenever a new content area is introduced into the bank. Currently,there are approximately 150 to 200 content categories in each subject area. Each contentcategory serves as the reference point for up to 200 items.

The process by which an individual item is assigned to a content category is simple.Generally, the item authors are provided with a copy of the content structure and areinstructed on how to interpret it. During item writing, the item authors identify anddocument which index category they believe to be most appropriate for the item. After theitem writing session, these assignments are reviewed by NWEA staff and any questionableassignments are discussed and altered if needed.

Field Testing

Once the items have been written, edited, reviewed for possible bias, and assigned to an itemindex category, they are field tested. The purpose of field testing the items is to collect datathat will be used to analyze the quality of the items and incorporate them into themeasurement scales. Since the collection of field-test data uses student and teacher time,every effort is made to make field testing as unobtrusive as possible.

There are three different ways that items are field tested. In the first method, students aregiven a separate mini-test following the completion of their actual assessment. This mini-test,called a field-test form, contains between ten and twenty field-test items.

The second field-test method involves administering assessments that contain several field-test items placed within the assessment so that they are transparent to the student. When theassessment is scored, only the data from the active items are used, so a student’s score is notinfluenced by the presence of field-test items. By constructing and administering the fieldtests in this manner, student time is minimally impacted and student scores are unaffected.

NWEA has also field tested items by administering a special test consisting of between thirtyand forty items, most of which are field-test items. Students do not receive scores when theytake these assessments. Field testing in this manner allows a great number of items to beevaluated in a short period of time but uses student time without providing scores that areuseful in the classroom. For this reason, this method is used only when necessary.

Page 28: NWEA Technical Manual 2003

24 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

To ensure that the quality of the data is high, field-test items are administered only in thegrade or grades suggested by the author. This ensures that the sample of students taking anyfield-test item is reflective of the sample of students who will be taking the item after itbecomes active.

The size of the student sample also affects the quality of the data. Each item is administeredto a sample of at least 300 students. Ingebo (1997) has shown that this sample size isadequate for very accurate item calibrations and item fit statistics.

Another essential aspect of quality data collection is student motivation. By embedding thefield-test items in a line test that is scored and reported, they appear identical to active items.As a result, students are equally motivated to answer field-test and active items.

Finally, the environment for data collection should be free from the influence of otherconfounding variables such as cheating or fatigue. Since the field-test data are collectedwithin the normal NWEA test administration process, which is designed to equalize orminimize the impact of outside influences, the environment is optimal for data collection.

The field-test processes provide excellent field-test data that in turn allows the addition ofhigh-quality items to the item banks. The items are administered to a sizable sample ofstudents, and the data students provide are collected in a manner that motivates the studentsto work seriously in an environment free from external influences on the data.

Item Analyses and Calibration

Having written, edited, and field tested items, the next step in the item development processis to analyze how the items perform. Two statistical indices are used to help identify unusualitems.

The first index is the adjusted point-biserial correlation. This index quantifies therelationship between the performance of students on a specific item and the performance ofstudents on the test as a whole.

The overall test scores for individuals who answer a particular item correctly is expected to behigher than the test scores of those who do not answer the item correctly. If this is the case,then the point-biserial correlation is positive and can approach a theoretical maximum of1.0. If more lower-performing students answered the item correctly, then the point-biserial islow or even negative. A low point-biserial value for an item usually indicates that the item isnot working with the scale. Items with negative point-biserial values may in fact be workingagainst the scale.

There are many reasons why a seemingly reasonable item might not be working well. Oneexample might be a mathematics question with very difficult vocabulary or sentencestructure. Another might be a reading question that could be answered without reading thepassage for the item.

The second index is the adjusted Root-Mean-Square Fit (RMSF). This index is a measure ofhow well the item fits the scale. The RMSF is calculated by comparing actual student

Page 29: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 25

© 2003 Northwest Evaluation Association, Portland, Oregon

performance to the theoretical values that are expected from the model. A high value for theRMSF indicates that the item does not fit the model well. Similar to the point-biserial, theRMSF indicates whether high-performing students are missing an item more often thanexpected or low-performing students are answering an item correctly more often thanexpected. This index can reveal more subtle difficulties with an item than the point-biserial.

For instance, an item with two correct answers may have an adequate point-biserial, but itmay have a high RMSF value. Items with a high RMSF value are reviewed graphically usingthe item response function for the item. The item response function is a plot that shows theprobability of correct response to an item against the achievement level of the student. Whenreviewing an item, the empirical item response function is plotted on the same scale as thetheoretical function. When there are large discrepancies between the two curves, there is alack of fit between the item and the scale. By reviewing the response functions, a morecomprehensive understanding of item performance can be gained.

Figures 6 and 7 show theoretical and empirically observed response functions for two items(each item was a difficult mathematics item, field tested with approximately 400 students).Figure 6 shows the results for an item that has poor fit to the measurement model (indicatedby a RMSF above 2.0). Upon review, the item was identified as being vaguely worded. Thisitem was rejected for use in the item banks. Figure 7 shows the results from an item withgood fit to the measurement model. This item was approved for use in the item banks.

Page 30: NWEA Technical Manual 2003

26 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

Figure 6: Item response function for a poorly-performing item

Figure 7: Item response function for a well-performing item

Upon the completion of item analyses, items that perform well are added to the item banksfor use in future test development. Those that do not perform well are flagged for review.

Theoretical and Empirical Response Functions

(RIT=244, RMSF=3.36)

0.000

0.200

0.400

0.600

0.800

1.000

200 220 240 260 280 300

Theoretical and Empirical Response Functions

(RIT=244, RMSF=0.80)

0.000

0.200

0.400

0.600

0.800

1.000

200 220 240 260 280 300

Page 31: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 27

© 2003 Northwest Evaluation Association, Portland, Oregon

During the review, minor content problems with the item are often uncovered andcorrected. From there, the item is field tested once again. If it performs poorly again andsubsequent review reveals no obvious problem with the item, it is either eliminated from theitem bank or retested in a different grade. These quality assurance procedures allow the itembanks to grow while filtering out those items with performance difficulties.

Periodic Review of Item PerformanceAfter the items have been developed and incorporated into the NWEA scales, NWEAmonitors the items, the banks, and the scales to maintain their quality. Although themeasurement scale underlying the items is theoretically invariant, sometimes the meaning ofan item can change as society evolves or as curriculum changes. For instance, a word maycome into common usage for a short time, either with or without its original definition.(The word “radical” is a classic example from the 1980s.) This type of change may alter thedifficulty of an item from its initial difficulty estimate for a period of time. (Similar difficultydrift may arise with items that test monetary skills with obsolete prices.)

This is also the case when the curriculum for a subject changes dramatically on a nationalscale, such as the increased emphasis on integrated mathematics over the last twenty years. Inan effort to keep the scale up to date, NWEA conducts periodic reviews of the contentwithin the item banks. Generally, these reviews are targeted at a specific issue, such as a set ofwords that may have come into common usage. When items with such content difficultiesare discovered, they are prohibited from use, although they may be revised and re-field testedif the content is otherwise worthwhile.

In addition to inspecting performance of specific items, additional studies are also performedperiodically to determine whether the scale itself is fluctuating or drifting across time. This isdone by recalibrating the items after several years have passed since the initial calibration. Arecent study (Kingsbury, 2003) investigated possible drift over a 22-year time period, usingover 1,000 reading items and 2,000 mathematics items administered to over 100,000students taking ALT and MAP tests. This study replicated two primary findings seen inearlier studies:

� The difficulty values of items has not changed across time more than would beexpected from normal sampling variation.

� The measurement scales have not drifted by more than 0.01 standard deviationsover the quarter of a century in which they have been used.

Periodic reviews of item performance and scale stability ensure that the item calibrations areappropriate and the scales are stable. This is one aspect of the information that is needed toensure that educational agencies can construct valid, reliable, and precise assessments usingthe item banks.

Page 32: NWEA Technical Manual 2003

28 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

The Assessment Process

Note: This section describes the rationale behind assessment administration,scoring, reporting, and results interpretation, as well as the technical specificationspertinent to each phase. There are several other documents that supplement thissection. Consult the MAP administration training materials and the ALTAdministration Guide to understand how to administer the assessments. Forinformation on how to score and generate reports for ALT consult the User’sManual for the NWEA Scoring and Reporting System.

Once an educational agency decides to incorporate NWEA assessments into their assessmentstrategy, it begins a process of creating an assessment system that meets its needs and is asefficient as possible. One of the first decisions is to determine which testing mode to use.Next, the tests are developed and then administered. Following administration, the tests arescored and checked to verify the integrity of the test data. Next, student, class, and schoolreports are generated for use by parents, students, teachers, and administrators. Lastly, theresults of the assessments are interpreted and used to inform educational decisions. Since thequality of each stage in the process is dependent on the quality of all of the previous stages,you can think of these stages as a pyramid as shown in Figure 8.

Figure 8: The NWEA assessment stages

Page 33: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 29

© 2003 Northwest Evaluation Association, Portland, Oregon

Test Design Content Definition Item Selection Test Construction

Start Finish

Test Design Content Definition Item Selection Test Construction

Start Finish

Test DevelopmentOnce a testing mode is selected, the first stage in the NWEA assessment process is testdevelopment. The NWEA test development process was established years ago by followingmethods pioneered by experts in the fields of psychometrics and educational assessment.Since then, the process has been refined to maximize the quality and efficiency of NWEAassessments and to include the delivery of the assessments via computer. The process isguided by the standards for educational and psychological testing as established byprofessional organizations in the field (AERA/APA/NCME, 1999).

The test development process consists of a set of decisions that must be made and activitiesthat must be performed to create a functional test, as shown in Figure 9. The first step is todesign the test. The next is to determine the aspects of the curriculum that will be tapped bythe assessment. After that, the test items are selected from the NWEA item bank. Lastly, theitems are assembled into tests and proofread for a variety of characteristics. At that point, thetest development process is complete and the tests are ready to be administered. Details oneach phase of the test development process are provided next.

Figure 9: The phases of the test development process

Test Design

The first phase in the test development process is test design. In this phase, the psychometrictest specifications—such as the test length and item difficulty range—are established. NWEAhas developed sets of optimal test specifications to use when designing and constructing theseassessments. These specifications are used by all of the agencies administering MAP and ALTassessments. They have been designed with a range and flexibility applicable to a wide varietyof educational situations to provide consistent, precise measurement.

ALT Test Specifications

For ALT, the test specifications developed during test design are:

� Difficulty range of each level test.

� Difficulty overlap of adjacent level tests.

� Number of items on each level test (also known as test length).

Page 34: NWEA Technical Manual 2003

30 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

These specifications adjust to fit the nature of the test content and the use of the scores. Bydefining these specifications carefully, NWEA can create a level test series with desirablemeasurement characteristics. These specifications allow NWEA to predict with someaccuracy the expected performance of the ALT system in terms of measurement precisioneven before the test is administered to a single student.

Difficulty range—Each ALT series contains five to eight individual test forms composed ofitems that increase in difficulty from one form (or level) to the next. This allows anindividual level to match the performance of a particular student while the series spans thecomplete range of student achievement. The distribution of item difficulty is typicallyuniform for any single ALT test, and the range of difficulty is the same for every level test inthe ALT series. When designing the ALT series, the size of the range of difficulty can bevaried. When it is varied, it is always varied for each of the level tests of the entire ALT series.

As the size of the range of difficulty increases, the test’s capacity to assess a broader range ofachievement levels increases. When this is done, the targeting of the test to the students isless precise, which reduces the precision of each student’s score. In addition, increasing therange of item difficulty in a particular level increases the chance that a student will see itemsthat are too difficult or too easy.

The standard range of difficulty for an NWEA ALT test is 20 RIT points. The easiest andmost difficult levels may extend beyond this range as needed to assess students accurately atthe extremes. These levels tend to have a range of about 40 RIT points.

Difficulty overlap—Between adjacent level tests, there is always an overlap in item difficulty.In other words, the hardest items of the easier level test are always more difficult than theeasiest items of the harder level test. The difficulty overlap used in the ALT tests is half ofthe difficulty range, or 10 RIT points. This design allows the entire range of commonstudent performance to be assessed with a series of seven to nine levels, as shown in Figure10.

Figure 10: ALT test structure

Test length—Since test precision and test length are related, the number of items on a testhas a significant impact on score precision. More items generally result in more precision inthe test scores. This is also true of the precision of goal scores. Although the final

RIT Scale

Test Widths and Overall Structure of a Typical ALT Series

L-2 L-4 L-6 L-8 L-1 L-3 L-5 L-7

150 160 170 180 190 200 210 220 230 240

Page 35: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 31

© 2003 Northwest Evaluation Association, Portland, Oregon

specification determined during ALT test design is the number of items that the testcontains, often NWEA starts with a desired precision and works back toward the number ofitems needed to achieve that precision.

Most level tests are designed to have a SEM of approximately three RIT points in the centralportion of the measurement range for the level. Similarly, most level tests aim for a SEM ofabout five RIT points for each goal score to be reported. The desired precision of the overallscore can be achieved with 40 items in a level, and the desired precision of the goal scores canbe attained with about seven items per goal. Therefore, NWEA ensures that every ALT seriescontains at least seven items per goal area and at least 40 items overall. It takes studentsslightly less than one minute, on average, to answer each item, so these tests can beadministered in a very time efficient manner. (Note that all MAP and ALT tests areadministered without time limits. As long as a student is working productively on a test, theyare allowed to continue.)

The optimal test design for ALT differs slightly depending on the subject. Within a subjectarea, however, the same test design is used for almost all NWEA tests. The optimal testdesign for an ALT series includes:

� Five to eight level tests.

� Eight or fewer major goal areas per test.

� 40-50 items per test.

� At least seven items per goal area.

� 20 RIT point range in item difficulty per test.

� 10 RIT point overlap in item difficulty between adjacent level tests.

MAP Test Specifications

For MAP, test design involves setting the following psychometric specifications:

� Size of the item pool.

� Distribution of item difficulty by the test blueprint.

� Desired length of the test, the scoring algorithm.

� Item selection algorithm.

These specifications allow NWEA to predetermine the expected performance of the MAPsystem in terms of measurement precision before the test is administered to a single student.

Scoring—All MAP assessments employ a common scoring algorithm. During theassessment, a Bayesian scoring algorithm is used to inform item selection. Bayesian scoringfor item selection prevents the artificially dramatic fluctuations in student achievement at thebeginning of the test that can occur with other scoring algorithms. Although the Bayesian

Page 36: NWEA Technical Manual 2003

32 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

scoring works well as a procedure for selecting items during test administration, Bayesianscores are not appropriate for the calculation of final student achievement scores. This isbecause Bayesian scoring uses information other than the student’s responses to questions(such as past performance) to calculate the achievement estimate. Since only the student’sperformance today should be used to give the student’s current score, a maximum-likelihoodalgorithm is used to calculate a student’s actual score at the completion of the test.

Item selection—All MAP assessments employ a common item selection algorithm. The itemselection algorithm works by initially identifying the ten items within the first content areathat provide the most information concerning the student. These items are the items withdifficulties closest to the current achievement level estimate for the student. After the tenitems are selected, one of the ten is selected at random and administered to the student. Bytargeting items in this manner, NWEA maximizes the information obtained from each item.This maximizes the efficiency of the assessment while also balancing the usage of the itemswith similar psychometric characteristics. Once the item is administered, the process repeatsitself, except that only items from the second goal area are selected. This continues until anitem from all of the major goal areas has been administered, at which point an item from thefirst goal area is selected once again.

Item pool structure—To provide each student a challenging test with an accurate score, theMAP system requires the use of large pools of items with a difficulty range appropriate for allof the students being tested. MAP item pools generally contain 1,200-2,400 items. Thedistribution of item difficulty for a good item pool reflects the distribution of studentachievement so that each student being tested is challenged, and therefore measuredaccurately. MAP item pools have items that challenge virtually every student tested.

Another important factor in item pool design is the distribution of item difficulty in eachreported goal area. If there is a gap in the pool of items in a major goal area at a particulardifficulty level, students at that achievement level are administered less than optimal items.These items might be slightly easier or more difficult than desired. This reduces the amountof information obtained from each item, thus reducing the test’s efficiency and precision. Tomaintain consistency across goal areas and difficulty levels, the MAP system contains a poolsculpting tool that maximizes the fit of the item pool to measurement and reporting needs.

Test length—As in ALT, the test length for MAP is determined during test design. Thedesired precision of the assessment is the most important factor in determining test length.Due to its adaptive nature, MAP tends to be slightly more efficient for most students thanALT. This combination of efficiency and flexibility allows for a fairly standardized test lengthfor MAP. Generally speaking, MAP assessments that include goal score reporting contain 40to 50 items.

Content Definition

One of the most important steps in developing a high-quality assessment is to make sure thatthe content of the assessment is a valid representation of the content that is to be taught inthe classroom. There are two aspects to defining the content for an educational achievementtest. The first aspect is defining the curriculum, which is a complete description of the

Page 37: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 33

© 2003 Northwest Evaluation Association, Portland, Oregon

content taught in the classroom. Almost all NWEA agencies have a pre-existing documentedcurriculum that can be easily used in the test development process.

Having identified the content to be included in instruction, the next step is to identify howit is sampled for the test. This is done by detailing the goals and objectives that make up thespan of content that could be included in a reasonable assessment. Figure 11 shows anexample of a goals and objectives document for a reading test aligned to the NWEALearning Continuum.

Figure 11: A sample goals and objectives document for a reading test

No assessment measures all elements of a curriculum, nor should an assessment try. A goodcurriculum contains a mix of elements that may be best assessed by formal assessments,classroom observation, evaluation of major projects, and a variety of other methods. Forexample, assessing listening skills is done in a very clumsy manner by formal assessments. It isnecessary for educational agencies to identify the specific elements of the curriculum that thetest will include.

To accomplish this, the agency needs to specify the percentage of test items to be includedon the test from each curricular goal and sub-goal. Most MAP and ALT assessments haveabout four to eight goals with five or six sub-goals each, and contain between 40 and 50items. Some educational agencies craft the distribution on criteria such as the emphasis ofcertain content to the overall curriculum. This creates the test blueprint to be used for itemselection. Like the blueprint that an architect follows, the specification of goals and objectivesprovides a plan that guides the rest of the test development process.

1. Word Meaning

– A. Use context clues – B. Use synonyms, antonyms and homonyms – C. Use component structure – D. Interpret multiple meanings

2. Literal Comprehension – A. Recall details – B. Interpret directions – C. Sequence details – D. Classify facts – E. Identify main idea

3. Inferential Comprehension – A. Draw inferences – B. Recognize cause and effect – C. Predict events – D. Summarize and synthesize

4. Evaluative Comprehension – A. Distinguish fact and opinion – B. Recognize elements of persuasion – C. Evaluate validity and point of view – D. Evaluate conclusions – E. Detect bias and assumptions

Page 38: NWEA Technical Manual 2003

34 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

Item Selection

The next step in the test development process is to select the items to appear on theassessments. During this step, the test blueprint and test design are used as a guide to selectthe items from the item banks that comprise the test or item pool.

ALT—To select items for ALT, item selection typically occurs at a single meeting. Duringthis meeting, teachers and other stakeholders are instructed on the principles that guide itemselection. The group reviews the items in the item bank and hand-picks the itemsindividually for each level test of the ALT series. By conducting item selection in thismanner, organizations can customize the assessment content to local requirements.

The principles that guide ALT item selection are fairly straightforward. Each level test of theALT series must have a set of items with a particular range and distribution of difficulty. Inaddition, the test blueprint needs to be followed precisely. Each level test must contain atleast seven items per major goal. In an effort to provide even greater content specificity,organizations are instructed to include a wide variety of sub-goals as well. Lastly, the itemsare reviewed to ensure that there are no questions that give away information needed toanswer other questions.

MAP—Item selection for MAP is quite different. Rather than selecting the individual itemsto comprise a test, it is necessary to construct a pool of items that the computer selects fromduring the test administration. MAP item pools can have between 1200 and 2400 activeitems. Since these pools are so large, it is more efficient to select items using automatedprocesses.

During item selection, NWEA staff members work with an agency’s test blueprint to selectthe index categories that best match the test blueprint. Staff members familiar with theindexes and the item banks review each of the selected index categories and assign each indexto a major goal of the test blueprint. From there, all items associated with the index in thismanner become candidates for use. Depending on the number of items in this candidatepopulation, all items are packaged for use, or the population is reduced through the use of apool sculpting tool that is designed to create a subset of the population of items that haveoptimal measurement characteristics and fit to the test blueprint.

Page 39: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 35

© 2003 Northwest Evaluation Association, Portland, Oregon

Test Construction

The final phase of the test development process is to construct the tests. The purpose of thisphase is to package the items into deliverable tests, and where necessary, set some of the testspecifications that were determined during test design. In addition to constructing the tests,the tests are reviewed as a step in quality control. Although the actual test construction isexecuted entirely by NWEA staff, the review process is a joint effort between NWEA staffand the NWEA partner agency.

ALT—For ALT, two documents are created during test construction. The first is the actualtest form. This document contains an introduction to the test, the instructions for the test,the test identification information, and the text and response options for each of the items.The second product created during ALT test construction is the series of files, called TPSfiles, that contain the information necessary for the NWEA Scoring and Reporting Software(SRS) to function.

In creating the test form, the items chosen during item selection are placed into the sequencein which they will appear. This initial draft has the items that will appear on the test andadditional information that will not appear on the final form such as the correct answer, theitem difficulty, and the item identification number. ALT forms are arranged so that theeasiest items appear toward the beginning of the test and the hardest items appear toward theend.

Once this initial draft is created, it is reviewed for content and localization by the educationalagency. The content review ensures that all important content is included and well balancedin the assessment. It also allows the agency to assure that names of geographic references andpeople on the tests are appropriate for their students. During this stage, final itemsubstitutions are made to correct any problems identified.

Once the initial draft has been reviewed and approved, a print master proof is produced.This master is reviewed by both NWEA staff members and the educational agency to ensurethat all graphics and formatting issues are resolved, such as widows and orphans, pagination,textual font, and margins. Once the test form is ready, a final print master of the test form isproduced.

The TPS file is the other document produced for each test, and its development parallels thedevelopment of the test form. At each stage in the process, a TPS file is produced thatmatches the current version of the test form. This file contains the item identifiers, theanswer key to each of the items, the names of the goals and objectives, and the unique goal towhich each item is associated. In addition, the raw-score-to-RIT-score conversion table isincluded.

MAP—For MAP, the test construction phase entails the creation and packaging of severalcomputer files that are used by the test administration software during the test event. Amongthese is the database that includes the item pool with the text, response options, answer key,and calibrated item difficulty for each of the items in the pool. There are also files containing

Page 40: NWEA Technical Manual 2003

36 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

the test specifications such as the number of items to be administered, the blueprint, bitmapgraphics, audio files, reading passage files, and the on-screen calculator.

Once the test files are created, NWEA staff members check the test by taking three sampletests, one simulating high performance, one simulating low performance and one simulatingaverage performance. During these sample tests, a thorough inspection of the tests’functionality occurs. The item selection algorithm is checked, the scoring routine is checked,and the goal scoring routine is checked. In addition, the data being collected during the testare examined for completeness and accuracy. Upon completion of the reviews, the testconstruction phase is complete and the test is ready to be administered.

Test AdministrationNWEA helps each agency administer the assessments in a manner that is efficient,psychometrically sound, and fair to each student. Although complete details about theadministration of NWEA assessments can be found in the MAP administration trainingmaterials, an overview of the technical aspects of the administration process is provided next.

ALT—Prior to test administration, the appropriate level test form for each student isidentified to ensure that the test is appropriately challenging and informative. If the test istoo easy, students will be bored and will not demonstrate what they know because the testquestions are not difficult enough. If the test is too hard, students will be frustrated, andagain, will not demonstrate what they know because the test questions are too difficult.

To identify a challenging test for a student who has previously taken NWEA assessments, theSRS software is used. Some districts choose to conduct this process themselves, while others relyon the expertise and resources of NWEA to conduct this step on their behalf. The SRS softwareuses a student’s valid test scores from the previous three years in the assessed subject toidentify the predicted score for the upcoming test with the following procedure:

1. The test score from each previous term and year is converted to a standardized score.This score identifies how far above or below the mean a student scored for the termand grade in which the test was taken.

2. The most recent of these standardized scores is duplicated so that it has twice theweight of any other score.

3. The average of all of these standardized scores is calculated. This averagedstandardized score tells us how many standard deviations the student ranks above orbelow the district average over the last few tests.

4. In order to obtain the predicted score for the current year, the average standardizedscore is multiplied by the standard deviation of scores in the student’s current grade.This result is added to the average score for the student’s current grade. Thisprovides a predicted score for the student that is as much above or below the averagedistrict performance as the student’s past performances.

Page 41: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 37

© 2003 Northwest Evaluation Association, Portland, Oregon

Once the predicted score is obtained for the student, SRS selects the level test that is closestin mean difficulty to the student’s predicted score. This information is printed on an answersheet for the student. If a student’s teacher has a strong reason to believe that the student haschanged markedly in achievement since the last test, then he or she may wish to administer adifferent level test.

If a student is being administered her or his first ALT test in a given subject, then a shortlocator test must be administered to determine which level test to provide to the student. Alocator test is a short wide-range test (containing 16 items) that determines which level test isappropriate for a student.

MAP—Since MAP is able to provide a test that is consistently challenging to all studentsregardless of their achievement level, agencies need not concern themselves with selecting theappropriate test form for each student. The MAP system selects the first item to be given to astudent based on the student’s past performance or the grade level in which the student isenrolled. Specifically, the first item is selected so that it is 5 RIT points below the student’sprevious performance. If no previous score is available, the first item is selected so that it is 5RIT points below the mean performance for students in the same grade from the normingstudy. If no grade level mean is available, the difficulty of the first item is set to an arbitraryvalue based on the subject and grade.

The Testing Environment

Test proctors conduct test administration and do so only after they have been trained bytheir assessment coordinator or NWEA staff member. Classroom teachers are generally theproctors for ALT. For MAP, however, the proctor is typically not the classroom teacher. Theproctor can be a technical assistant, educational assistant, or some other staff memberassigned and trained for that task. NWEA encourages the classroom teacher to be in the labwith the proctor.

Trained proctors have four tasks to perform when administering a test:

� Ensure that the testing environment meets administration standards.

� For ALT, provide each student with the appropriate test and answer sheet.

� For MAP, ensure that each student is at the correct computer.

� Provide the test instructions to the students.

� Monitor students while they take the test.

The first task that test proctors perform as part of test administration is to ensure that thetesting environment meets administration standards. The proper testing environment helpsto ensure that the test-taking experience is consistent and the results are fair and accurate.Since each educational agency has different facilities and different resources available fortesting, the testing environments in each agency differ as well. By enabling proctors to makethe testing environment secure, comfortable, and free from distractions and inappropriate

Page 42: NWEA Technical Manual 2003

38 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

resources such as text books, notes, and instructional posters, the setting provides all studentswith an equal opportunity to perform on the assessment.

In order to make the assessments fair for all students, those students with an IndividualizedEducation Plan (IEP) may be granted special modifications that should be planned for aheadof time. There are six types of modifications that can be granted to students with an IEP:

� Changes in the timing or scheduling of the assessment.

� Changes in how the test directions are presented.

� Changes in how the test questions are presented.

� Changes in how the student responds to test questions.

� Changes in the test setting.

� Changes in the references and tools that are provided during the test.

Proctors also provide the test instructions to students and monitor the students as theycomplete the test. Proctors are provided with specific test instructions that they read to thestudents before taking the test, modifying the text only when it does not pertain to either thetest takers or the testing environment.

Once testing begins, the proctor monitors students to ensure that they work independently,do not disturb each other, and do not attempt to cheat on the test. Proctors are explicitlyinstructed not to help students with any problems or to read problems to students unless thatis part of a student’s IEP testing modification.

Proctors are instructed to invalidate the test if a student:

� Copies or receives verbal help from another student.

� Answers randomly without reading the questions.

� Refuses to take or continue the test.

� Seems unable to comprehend directions or questions.

� Exhibits disabling anxiety.

There are three administration specifications that are worthy of note. First, there is no timelimit placed on students for the administration of the assessment regardless of whether it isan ALT or a MAP test. Second, students are not permitted to skip questions on MAP tests,nor can they return to earlier questions. Last, the MAP system tracks the items that studentsanswer. Each time that a student takes the exam, the MAP system ensures that only freshitems, items never taken previously by the student, are administered. The MAP systemallows up to four test administrations per student per year per test.

During the administration of a MAP test, the test proctors are responsible for navigating thecomputer through test start up, student breaks, and unexpected system failures. Before a

Page 43: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 39

© 2003 Northwest Evaluation Association, Portland, Oregon

student sits down to take the test, the testing district or agency must submit a Class RosterFile (CRF) to NWEA with all of the pertinent identifying information for each student. Thisincludes the student IDs, names, and grade levels. Once this data are processed by NWEA, itis transmitted to the test sites and proctors can schedule tests for students. When students areready to test, proctors must log into each student’s workstation and load the test. This entailslocating the student name in the CRF and initiating the test session. Once the start-upscreen appears, the student can begin testing.

Students normally complete each test without interruption. However, if a student needs totake a break, he or she must raise his or her hand and indicate this to the test proctor.Proctors then pause the test at the student’s workstation. Upon completing the break, theproctor resumes the test for the student. When the test is resumed, the student is presentedwith a different item than the one that was displayed on-screen prior to the break. This isdone to prevent students from looking up answers while taking their break.

In the event of a system failure, such as a power failure, the proctor can resume tests thatwere in progress prior to the failure. The MAP system is designed in such a way that data arewritten to the test record following the completion of every item. This design means that noitem responses are lost. When a proctor restarts the system and the student’s test session, theproctor is automatically given the option to resume the student’s test where the student leftoff or to start a new test. This decision is left up to the test proctor, who may choose toinitiate a new test if the student was only at the beginning of an assessment or may choose toresume the test if the student was well into the test.

Sometimes a student may not complete the assessment during a scheduled test session. Forexample, a student may become ill and need to leave school. The proctor can terminate thetest with the option to resume the test later. If the proctor chooses to not make the testresumable, the terminated test is invalidated and a new test is generated.

Once a student completes a MAP assessment, the student’s overall RIT score and goal scoresare displayed on the screen. The proctor may print these scores if he or she chooses, but thisis not necessary since the scores appear later within various reports. After the student reviewsthe scores, the MAP system advances to the next test assigned this student or to the studentselection screen.

Several different mechanisms are in place to ensure the technical quality of the MAPadministration. During training, proctors are instructed to complete an Item Report Formwhenever they encounter a questionable item. These reports are monitored and adjustmentsare made to the system as needed. Upon any modifications to the administrative software orsystems, tests are repackaged and go through the quality assurance process described above.This process includes thorough testing of the scoring algorithms and item selectionalgorithms.

Although test administration facilities may differ slightly from one agency to another,standardization of the test administration procedures is simple and straightforward. NWEAprovides test proctors with the training and resources necessary to administer the MAP andALT assessments in a manner that promotes optimal performance and is fair to all studentstaking the assessments.

Page 44: NWEA Technical Manual 2003

40 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

Test Scoring and Score ValidationTest scoring and score validation follow test administration. It is the responsibility of NWEAand the educational agency to ensure that test scores are accurate. Since scoring andvalidation are done using computerized software, these systems have been rigorously tested toensure that scores rendered from the system are calculated and displayed correctly.

As with other IRT-based testing systems, the scoring procedures are somewhat different thanthose in more traditional testing systems. The proportion correct on a given test does notprovide useful information. Rather, a computer calculates the students’ scores taking intoaccount not only the students’ performance on the test items but the difficulty of the itemsthat each student is administered. In a manner similar to a diving or gymnastics competition,the student’s score is based on the difficulty of the items attempted and the student’sperformance on each item.

Achievement Level Tests (ALT)

For ALT, students’ answer sheets are scanned and scored using the SRS software. Somedistricts choose to conduct this process themselves, while others rely on NWEA to conductthis procedure.

Before scoring the tests, the test identification information on the answer sheets is validated.The agency administering the test is notified of any incomplete or erroneous studentidentifiers, student names, student grade levels, or ALT test levels. After these errors arecorrected, the answer sheets are scanned and scored.

Today’s scanning technology is accurate. There are occasions when a student fails to shade ina response dark enough for the scanner to detect it. When this occurs, the scanningequipment very accurately converts the answer sheet data into digital data. The only timethat human intervention is needed in the scanning process is when a student has obviouslyfilled in the wrong section of the answer sheet, in which case the test data are transferred tothe appropriate section.

The scoring algorithm contained within the SRS is also quite accurate. This algorithmcompares student responses to the key for the items and identifies whether the studentanswered the item correctly or incorrectly. Anytime updates to the SRS are made, sampletests are scored using the new system. The accuracy of the new system is verified when thesescores are compared to known scores.

NWEA has additional procedures that can validate the key for each item. These procedurescompare each item’s known statistics to the statistics that are calculated from the data beingscored. Any items with incongruent statistics are investigated to assure that the key for theitem is coded correctly. Any items with invalid keys are rescored using the correct key.

After the tests are scored, the results are validated. A student’s ALT score is invalidated if:

Page 45: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 41

© 2003 Northwest Evaluation Association, Portland, Oregon

� The percentage of items that the student answered correctly is less than or equal tothe percentage that would be obtained by chance guessing plus five percent (25% formathematics and science and 30% for reading and language usage).

� The percentage of items that the student answered correctly is greater than or equalto 95%.

� The SEM is greater than 5.3 RIT points.

� The student omitted more than half of the items.

All students who receive invalid test results are flagged for retesting. The SRS produces a listof these students. Agencies are instructed to provide these students with a different test fromthe ALT test series. If students have underachieved on the test they were provided, they areadministered a test two levels below their first one. Students who overachieve are providedwith a test two levels above their first one. Agencies typically test any students who wereabsent during the initial testing along with the students who require retesting. Once allstudents have been tested or retested, the scoring process for these students happens as it didthe first time around.

To summarize, ALT scoring and validation is a three-stage process. First, the student and testidentifying information is validated. Second, the test data are scanned and scored. Third, thetest scores are validated. Any students who have invalid test scores are then retested and theirtest results are scored. Once all tests have been scored, reports can be generated.

Measures of Academic Progress (MAP)

For MAP, student scores are calculated by the computer during test administration. Inaddition, the computer rescores the test event to double-check its correctness immediatelyfollowing the assessment. Shortly after the assessment, the results are transmitted back to thecentral MAP database for score validation. Score validation also happens during this datacollection process. A MAP score is invalidated if:

� The student completed the test in less than six minutes.

� The SEM is greater than 5.5 RIT (unless the student’s score is greater than 240RIT).

� The SEM is less than 1.0 RIT, which is an indicator of some kind of technicaldifficulty with the assessment.

In addition to calculating the students’ overall scores, both MAP and ALT assessmentsprovide additional score information for each assessment:

� RIT Score—A RIT score is an objective indicator of a student’s overall achievementlevel in a particular subject. Although theoretical RIT scores range in value fromnegative infinity to positive infinity, typical scores fall between 150 and 300. RITscores are equal interval in nature meaning that the distance between 150 RITs and151 RITs is the same as the distance between 230 RITs and 231 RITs.

Page 46: NWEA Technical Manual 2003

42 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

� SEM—The SEM is a measure of the accuracy or precision of an assessment.Assessments that are more accurate have a smaller SEM. The SEM of an NWEAassessment is calculated using maximum-likelihood procedures. Although the SEMcan theoretically range from zero to infinity, typical values fall between 2.5 and 3.5RIT points for an ALT or MAP test.

� RIT Range—The SEM is used to calculate what a student’s expected score would beon repeated testing. The range of expected scores is called the RIT range. If the SEMof a score is three RIT, then the student has a 68% chance of scoring within +/-three RIT points of his or her RIT score.

� Goal Performance—A student’s achievement level in each of the goal areas of thetest is calculated. This is done using only the student’s performance on items from asingle goal area. Since there are so few items administered in a single goal(approximately seven items per goal), goal scores have a relatively high SEM. It is forthis reason that there are only three possible goal scores, HI, AV, and LO. Goalperformance of LO means that the student is performing at the 33rd percentile orlower. Goal performance of AV means that the student is performing between the34th and 66th percentile. Goal performance of HI means that the student isperforming at or above the 67th percentile.

� Percentile Rank—The percentile rank is a normative statistic that indicates howwell a student performed in comparison to the students in the norm group. Themost recent norm sample was a group of approximately 1,000,000 students fromacross the United States. A student’s percentile rank indicates that the student scoredas well as, or better than, the percent of students in the norm group.

� Percentile Range—The percentile range includes the percentile ranks that areincluded in the RIT range. As a result, there is a 68% probability that a student’spercentile ranking will fall within this range if the student tested again relativelysoon. The percentile range is often asymmetric due to the fact that percentile ranksare not an equal interval measure of performance.

� Lexile Score—This score is only provided on reading assessments. The Lexile scoreis an assessment of the student’s performance on the Metametrics Lexile frameworkscale that can be used to assist selection of appropriate reading materials for thestudent. Lexile scores are calculated directly from the student’s RIT score using atransformation identified in a series of research studies. More information on theLexile framework can be found at www.lexile.com.

Page 47: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 43

© 2003 Northwest Evaluation Association, Portland, Oregon

Report Generation

Achievement Level Tests (ALT)

Once ALT test data are scored and validated, SRS can generate reports for a teacher, school,or district that summarize the data for an individual student, class, school, or district. Acomplete description of the ALT reports that are available is found in the Scoring andReporting System User’s Manual (NWEA, 1996).

The completeness and accuracy of the reports is largely dependent on the data provided byNWEA agencies through the Student Master Files (SMF). These files tell the SRS whichstudents are in each class, school, and district. Most errors produced by the SRS stem fromerrors in the information provided on the SMF and the student answer sheets. Whenever anagency reports an error, SMF information can be reprocessed and reports can be reproduced.In order to minimize these types of errors, each test site is provided with a copy of the SMFspecifications and given training in their use.

To ensure the confidentiality of the reports and all test-related information, NWEA usesreputable couriers—Federal Express and UPS—and tracks all documents that are shipped toagencies.

Measures of Academic Progress (MAP)

Once MAP test data have been scored and validated, educators can access the results throughNWEA’s secure reports website. Prior to the close of a testing season, teachers can view theresults of the assessments for all students in their class through the Teacher Report. After thetesting session closes, teachers can view detailed information about each student on theIndividual Student Progress Report. Teachers can also view information on every student intheir class in the Class Report. Administrators can view information on students, classes,grades, or schools by ordering a Class Report, Grade Report or District Summary Report.

As with the SMF process, the accuracy of MAP reports is largely dependent on the dataprovided by the agency administering the tests. The MAP system uses the CRF for studentclass information and the Special Programs File (SPF) to identify all of the special programsin which each student participates. Each MAP agency is trained concerning the details ofcompleting these files. Additionally, files coming from the MAP agencies are inspected uponreceipt. Although these validations prevent many errors, there are still students who are listedin the wrong class or school. When this occurs, NWEA works with the agency to correct theerror.

In order to ensure the confidentiality of the MAP reports, the reports website has a robustauthentication system. MAP Coordinators are provided with a six character login andpassword that they must use in order to enter the system. The reports website is secured via128-bit encryption using secure socket layering. This level of security is equivalent to thelevel provided by most financial and medical institutions in the United States.

Page 48: NWEA Technical Manual 2003

44 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

Results Interpretation “The test developer should set forth clearly how the test scores are intended to beinterpreted and used.” – Standard 1.2 APA/AERA/NCME Standards (1999).

In order to be useful, the results of assessments must be interpreted appropriately. In fact,one way that NWEA’s mission, “Partnering to help all kids learn,” is fulfilled is by helpingmember agencies interpret the reports well and use the test data to provide a bettereducational experience for all students. Some of the training elements and documentationthat enhance the appropriate use of data are:

� A workshop series on the assessment process, from administration to longitudinaldata use.

� On-line resources including the NWEA Learning Continuum, technicaldocumentation, and research briefs.

� Publications detailing administration procedures, modifications andaccommodations, and annotated reports with interpretation guidelines.

� Research, test administration, and software support (on site, telephone, and e-mail).

NWEA conducts over 800 workshops each year to train educational agencies to interpret testscores appropriately and to help these agencies make use of the information to improve eachstudent’s education. Each of these workshops is designed to meet the needs of a specificaudience. Table 1 details three workshops concerning the use of assessment data.

Page 49: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 45

© 2003 Northwest Evaluation Association, Portland, Oregon

Table 1. NWEA data assessment workshops

Title Stepping Stones toUsing Data

Climbing the DataLadder

Leading with Data

Length 1 day 2 days 2 daysAudience Educators Educators who have

attended the SteppingStones to Using Dataworkshop and one falland spring of test data

DistrictSuperintendents,Principals,CurriculumCoordinators,AssessmentCoordinators

TopicsCovered

� Understanding testresults statistics

� Recognizing diversityin the achievement ofstudents within aclassroom

� Using test results todevelop flexibleclassroom groupings

� Introduction to theLexile Scale

� Introduction to theNWEA LearningContinuum

� Conferencing withstudents and parentsabout test results

� Using Lexile scores� Relating test results to

the NWEA LearningContinuum (2)

� UnderstandingGrowth Patterns

� Setting academic goalswith students andparents

� Teaching with flexibleclassroom groupings

� Relating test results tostate standards

� Using data to guideinstructional practice

� Using test results as ameasure of studentgrowth

� Using growth data forprogram evaluation

� Using Leader’s Edge:Growth Analysis Tools

� Using growth data forschool improvement

A set of on-line resources designed to help educators interpret the reports is also readilyavailable. The MAP reports website contains annotated sample reports that provide concisedefinitions of the information provided on each report in the MAP series. Similar reports areprovided in paper form to districts using ALT reports generated from the SRS system. Inaddition, the document RIT Scale Norms (NWEA, 2002) outlines the normal performance ofstudents taking MAP and ALT tests. The information in this document is fundamental tousing percentile scores appropriately.

Finally, one person at each test site is trained to be a contact person. Contacts are trained toprovide on-site report interpretation support to teachers, parents, students, and otherstakeholders. If the test site contact is unavailable or unable to answer questions, NWEA staffmembers are available via phone or e-mail.

Page 50: NWEA Technical Manual 2003

46 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

Customizing Assessments

One criticism of large-scale standardized tests is that they assess achievement from a regional,state, or national perspective without capturing the nuances of each local educational system.Noting this weakness, NWEA designed a testing system that customizes the content ofassessments to improve the pertinence of the results to local educators. Still, NWEA’s testingsystem has always had a keen eye towards its global capabilities as well. MAP and ALT aredesigned to be able to provide results that are both locally applicable and globally comparable.This section details how this is accomplished.

Localizing NWEA AssessmentsThere are two primary ways in which agencies have the opportunity to make the assessmentlocally applicable. They can define the content to meet their needs, and for those agenciesthat desire an extremely high degree of customization, they can actually write test items forthe item banks. By defining the content to meet their needs, each educational agency makesthe assessment content more locally pertinent and facilitates reports that are aligned with thestructure and language of their local curriculum standards.

When defining the content of the assessment, agencies have the opportunity to undertakethree levels of localization:

� They can use the tests aligned to the NWEA Learning Continuum.

� They can use a test that has been aligned with their state’s standards.

� They can develop a test based entirely on their local curriculum.

The NWEA-Learning-Continuum-aligned test is a logical synthesis of the goals andobjectives of a cross section of educational agencies. While this test is a good synthesis ofexisting content standards, it is the test that is the least localized.

Agencies wishing to choose an assessment customized more to their local needs may wish touse a state-aligned assessment. At this point, NWEA has constructed eight different state-aligned ALT assessments and 48 different state-aligned MAP assessments. The content foreach of these assessments was designed following the same procedure. The published contentstandards established by each State Department of Education were reviewed by NWEA stafffamiliar with the item banks and the item indexes.

In creating the goals and objectives for the state-aligned assessment, each major goal wastitled as one of the individual content standards or a logical combination of two or morestandards. Each of the sub-goals within a major goal was titled from the standards within

Page 51: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 47

© 2003 Northwest Evaluation Association, Portland, Oregon

each of the respective content standards. From there, the blueprint was created by evenlydistributing the content across each major goal.

Item selection was conducted by using the item indexes to select the items from the bankthat most clearly reflected the content standards for the state, after which each item indexwas mapped to a single major goal. By developing the test in this fashion, the content of astate-aligned assessment is solidly aligned with the content standards for the state. Onceconstructed and administered, it is capable of providing feedback on each student that isuseful for all educators in helping to monitor student growth toward the state’s standards.

Sometimes a local educational agency may wish to construct an assessment that is totallycustomized to meet its own needs. In this situation, a unique goals and objectives document,a unique test blueprint, and a unique test design are created. The agency may also wish to beinvolved in item selection. If so, they may either select the individual items themselves (forALT) or work with the item indexes to select items from the item banks to construct theitem pools (for MAP).

By providing these three levels of content customization, NWEA can assist agencies with avariety of different needs. Agencies wishing to implement a solution quickly can chooseeither the NWEA-Learning-Continuum-aligned assessment or the state-aligned assessmentand have confidence that the results will be valid, reliable, and useful to them in manydifferent ways. On the other hand, agencies with specialized needs may further localize theassessment.

Some agencies may desire an even higher level of customization. These agencies may writeadditional test items to fill perceived gaps and to provide coverage of specific topics. Thedevelopment process for these items is identical to the development process normallyfollowed to add items to the item banks. Like all items that appear on MAP and ALTassessments, these items are calibrated to the underlying measurement scales using thepsychometric techniques outlined earlier. Agencies choosing to write their own items arelimited only by their capacity when customizing assessments for their local needs.

Making Global Use of Localized AssessmentsAs mentioned earlier in this section, one of the fundamental benefits of using NWEAassessments is that test results are globally comparable. That is, the results of any twoassessments of the same subject are directly comparable. These results can be from the samestudent at two different testing occasions or from two different students in two differenteducational settings.

You may wonder how this is possible considering the amount of effort to make theassessments locally applicable. The psychometric ingenuity that makes this possible stemsfrom the use of IRT as the framework for the development of the NWEA assessments andscales. An explanation of how this framework and the NWEA assessment system promoteglobal comparisons of locally applicable test results may be useful to those interested in usingtests results to make comparisons of any kind, especially those beyond a single educational

Page 52: NWEA Technical Manual 2003

48 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

setting. Before proceeding, you may wish to review the section The Measurement Model,which describes the IRT framework.

A simplistic approach to global comparison is to administer the exact same test to everystudent. This is not appropriate, considering the wide range of student achievement levels,the wide variety of content taught and assessed worldwide, and the small amount of timeallowed for test administration. To overcome this situation, the testing communitydeveloped a method whereby only a sample of appropriate items needed to be administeredto each student in order to infer the student’s overall achievement level. This method hingeson being able to directly place the difficulty of every test item and the achievement level ofevery student on a common scale.

The IRT framework allows for the creation and maintenance of the scale. It also makes itpossible to calculate the difficulty of new test items on the scale. Since each item is fieldtested on a wide variety of students, the difficulty of the item on the NWEA scale is a globalvalue. The item difficulty is applicable to all students regardless of age, grade, achievementlevel, or curriculum. In turn, the results of all students to an individual item are directlycomparable.

Extending this, the results of assessments containing any combination of items with difficultyestimates on the common scale are also directly comparable. Since all NWEA assessmentscontain items with difficulty estimates on the common scale, the results of these assessmentsare also directly comparable. By having only one IRT measurement scale for each subjectarea, NWEA enables districts to construct exams using any combination of items that suitstheir local needs while still having the capacity to compare the results globally.

The concepts explained above apply to the NWEA Learning Continuum as well. Thecontinuum was constructed by determining the average difficulty of the items pertaining tovarious content topics. As explained above, the difficulty of each item is a global difficulty,thereby allowing the average difficulty of items pertaining to a specific content area to beglobally applicable.

In summary, agencies using MAP and ALT create an assessment system containing locallyapplicable content, test results, and reports by engaging in content definition, item selection,and, sometimes, the item development process. By using IRT models in conjunction withsound field testing processes, these tests provide globally-comparable test results and aglobally-useful learning continuum.

Page 53: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 49

© 2003 Northwest Evaluation Association, Portland, Oregon

Operational Characteristics of theAssessments

Each NWEA assessment conforms to the tenets outlined within this document. There arefour basic subject areas: mathematics, reading, language usage, and science. Within science,there are two different assessments, one is called Science Concepts and Processes and theother is called Science Topics (Life, Earth/Space, and Physical Science). The following tableslist the optimal test specifications for each test subject and the optimal difficulty range foreach subject.

Table 2: Common ALT Specifications by Subject

Specification Mathematics Reading Languageusage

Science

Number of majorgoals

7 6 6 6

Number of differentALT levels

8 8 7 5

Number of itemsper test

50 40 40 40

Number ofresponse optionsper item

5 4 4 5

Size of the difficultyrange of each level

20 RIT points 20 RIT points 20 RIT points 20 RIT points

Size of the difficultyoverlap betweenadjacent levels

10 RIT points 10 RIT points 10 RIT points 10 RIT points

Administration time Untimed Untimed Untimed Untimed

Table 3: The difficulty range of the test items in each level of a common ALT Test

Subject Level1

Level2

Level3

Level4

Level5

Level6

Level7

Level8

Mathematics Min-180

170-190 180-200 190-210 200-220 210-230 220-240 230-max

Reading Min-170

160-180 170-190 180-200 190-210 200-220 210-230 220-max

Languageusage

Min-180

170-190 180-200 190-210 200-220 210-230 220-max

Science Min-191

181-200 191-210 201-220 211-max

Page 54: NWEA Technical Manual 2003

50 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

Table 4: Common MAP Specifications

Specification Mathematics Reading Language usage Science

Number of majorgoals

7 6 6 6

Number ofresponse optionsper item

5 4 4 5

Number of itemsper test

50 40 40 40

Minimum number ofitems per goal

7 7 7 7

Minimum number ofItems in Test Pool

1500 1200 1200 1200

Minimum number ofItems in test poolper Goal

200 200 200 200

Difficulty of initialitem

5 RIT points below previous score or 5 RIT points below grade levelmean

Ability estimation Bayes for item selection and Maximum Likelihood for final abilityestimate

Item selectionalgorithm

Ordered Cycle of Goal Area; select the 10 items from the chosengoal area that maximize test information; select one of the 10 itemsat random

Administration time The test is untimed, but most schools schedule 75 minute blocksNumber of testsallowed by a singlestudent in a givenschool year

4

In addition to the typical NWEA tests, NWEA has constructed a number of other tests fromthe same item banks. A description of each test follows.

The Survey Test is a 20 item fixed-length, adaptive assessment that is administered on thecomputer via the MAP system. NWEA offers the survey test in each of the four subject areasaligned with the state’s curriculum. Generally, these tests take students about 20-30 minutesto complete. This test is designed to provide a quick, overall assessment of a student’sachievement level and does not provide goal scores.

NWEA also offers five different end-of-course assessments in mathematics. Theseassessments include Algebra I, Algebra II, Geometry, Integrated Math I, and IntegratedMath II. Each of these assessments is available in MAP or ALT. Each test is aligned with theNWEA goal structure and contains 40 items. These assessments are designed to measurestudent’s overall achievement at the end of a mathematics class and also provide goal areainformation.

Page 55: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 51

© 2003 Northwest Evaluation Association, Portland, Oregon

Validity “Validity is the most fundamental consideration in developing and evaluatingtests.”—Chapter 1, Standards for Educational and Psychological Testing(APA/AERA/NCME, 1999).

Validity was defined earlier as the degree to which an educational assessment measures whatit intends to measure. From the development process, it is clear that MAP and ALT containcontent that is appropriate to measure the achievement level and growth of students in thesubjects of mathematics, reading, language usage, and science. It is also important to presenta body of evidence that indicates that the scores from MAP and ALT can be used to makeaccurate statements about student capabilities and growth. In order to support this claim,NWEA provides the following forms of validity evidence:

� Research pertaining to the equivalency of scores across the two testing modalities(ALT and MAP).

� The MAP administration training materials, which provide instructions on theappropriate uses and inappropriate uses of the various NWEA assessments.

� This technical manual as a document that explains the ways in which NWEApromotes valid test results interpretation.

� This technical manual as a document that provides the processes used to ensure thatthe content of the exams is valid.

� This technical manual as a document that provides the processes used to ensure thattest scores are constructed in a manner that is valid.

� Concurrent validity statistics that correlate NWEA test results with other majornational or state educational assessments.

Over the years, the testing field has presented substantial evidence supporting the hypothesisthat tests administered under traditional paper-and-pencil and computerized adaptive testingmodalities can be equivalent (Zara, 1992). Kingsbury and Houser (1988) and Kingsbury(2002) investigated MAP and ALT scores and found that they were equivalent in each study.The difference between the test-retest correlation of ALT to ALT tests and ALT to MAPtests was less than 0.05 and the largest observed mean difference was 1.5 RIT. Thesefindings are important because they provide evidence that ALT and MAP may be consideredequivalent testing forms.

A test score cannot simply be valid or invalid. Instead, a test score is valid or invalid for aparticular use. ALT and MAP survey-with-goals test scores are valid for measuring theachievement level and growth of students. They are also validly used for course placement,parent conferences, district-wide testing, and for identifying the appropriate instructionallevel for students and for screening students for special programs.

Page 56: NWEA Technical Manual 2003

52 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

The content of MAP and ALT assessments is valid for their intended uses. The items arewritten by classroom teachers. The manner in which the goals and objectives for each test aredeveloped promotes a high degree of alignment between the curriculum and the test content.For ALT, teachers pick the items for their test, maximizing the relevancy of the test for theirsetting. For MAP, the item selection algorithms promote the selection of a group of itemsthat most completely align with the goals and objectives of the test as well as the achievementlevel of the students. The statistical and experimental procedures used to develop the itembank, measurement scales, and the assessment development process are a direct result ofNWEA’s goal to create an assessment system capable of providing tests that are locallyapplicable. Taken together, the item development process, test content definition process,and test construction process provide strong evidence that the content of MAP and ALTassessments is valid for its intended use.

The NWEA scales and the scores that stem from tests containing scaled items wereconstructed in a manner that is widely accepted as valid. The Educational Assessment sectionoutlined the paradigm of item response theory and explained how NWEA scale and testconstruction procedures are guided by this theory. The judicious use of this paradigm,coupled with additional research and experimental design, results in scales and tests, resultsin scores with a variety of valid interpretations arise.

A primary element of evidence supporting the validity of the NWEA assessment scores fortheir intended uses is the series of concurrent validity studies that have compared scores froma variety of state assessments to the scores from MAP and ALT assessments. Table 5 displaysthe summary of the outcomes of these studies. MAP and ALT test scores consistentlycorrelate highly with other measures of academic achievement for each state in which a studyhas been performed.

In looking closely at the trend in correlations throughout the grades, NWEA test scores tendto be more similar to other test scores in the upper grades than the lower grades. This is mostlikely due to the increase in the reliability of scores obtained from students in higher grades.From the data provided in Table 5, it is clear that NWEA test scores are strongly related toother major educational test scores that they are valid for similar uses.

Overall, there is substantial evidence supporting the validity of the NWEA assessments formeasuring the achievement level and growth of students in the major subject areas. Thecontent is valid. The comparability of the two test administration modalities is high. Thescores are constructed in a valid manner. Test users are instructed about the appropriate usesof the test scores. Finally, the tests are correlated with other major tests indicating that theyare valid for similar uses. NWEA has a great deal of confidence in the validity of theassessments for their intended uses.

Page 57: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 53

© 2003 Northwest Evaluation Association, Portland, Oregon

Table 5: Concurrent Validity Statistics for the NWEA Assessmentslast updated 10/11/02

Data Set Year Term 2 3 4 5 6 7 8 9 10

Concurrent 2001 spring Reading r .86 .87 .87 .86 .86 .87 .87 .82

N 5,550 7,840 7,771 7,724 3,832 3,885 3,557 4,759

spring Language r .78 .84 .82 .82 .82 .83 .83 .82

N 5,633 7,806 7,916 7,793 3,799 3,828 3,509 4,438

spring Mathematics r .80 .85 .85 .87 .88 .87 .87

N 5,666 7,878 7,929 7,794 3,834 3,841 3,508

Concurrent 2000 spring Reading r .84 .87 .86

N 3,488 3,486 6,337

spring Mathematics r .91

N 5,023

Concurrent 1999 fall Reading r .77 .84 .80

N 1,456 1,473 1,373

fall Language r .77 .79 .79

N 1,441 1,466 1,397

fall Mathematics r .74 .83 .84

N 1,425 1,460 1,365

Type Validity

Content AreaGrade Level

Colorado State Assessment Program (CSAP) scale scores and ALT scores from same students

Stanford Achievement Test, 9th Edition (SAT9) scale scores and ALT scores from same students

Iowa Tests of Basic Skills (Form K) and Meridian Checkpoint Level Tests

Concurrent 2000 fall r .79 .84 .86 N 4,096 4,296 3,828

fall r .78 .82 .84 N 4,096 4,296 3,828

fall Mathematics r .74 .86 .90 N 4,133 4,299 3,829

Concurrent 1998 spring Reading r .81 .80 N 2,286 2,271

spring Mathematics r .80 .85 N 2,203 2,266

Concurrent 1999 spring Reading r .75 N 1,003

Mathematics r .81 N 849

Concurrent 2000 spring Reading r .76 .79 N 1,452 1,247

spring Lang Usage r .60 .68 N 1,063 1,002

spring Mathematics r .79 .81 N 1,458 1,552

Reading (ALT) /Lang Arts (ISTEP) Lang Usage (ALT) /Lang Arts (ISTEP)

Indiana Stat ewide Testing for Educational Progress -Plus (ISTEP+) and ALT scores from same students

Wyoming Comprehensive Assessment System and ALT scores from same students

Washington Assessment of Student Learning (grd 10, spr 2000)

and ALT scores (grd 9, spr 1999) from same students

Washington Assessment of Student Learning and ALT scores from same students

Page 58: NWEA Technical Manual 2003

54 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

Reliability of ScoresIn assessing the psychometric soundness of an assessment, one of the most widely usedindicators is the reliability of the assessment. Reliability is an indicator of the consistency oftest scores and is expressed in the same manner as a correlation coefficient. Possible values ofmost reliability coefficients range from 0.00 to 1.00. Values in excess of 0.70 are generallyconsidered acceptable and values above 0.90 are considered good.

The reliability of MAP and ALT scores has been calculated in two different manners. Onemethod uses marginal reliability (Samejima, 1994), which may be applied to any testsconstructed using IRT. Marginal reliability is one of the most appropriate methods ofcalculating reliability for adaptive tests. It uses the test information function to determine theexpected correlation between the scores of two hypothetical tests taken by the same student.It also allows the calculation of reliability across multiple test forms, as in ALT and MAP.

Marginal reliability is, therefore, a very appropriate measure of the overall psychometricsoundness of the NWEA assessments. Table 6 displays the marginal reliability for the threemajor subjects in the NWEA assessment suite by grade level. All of the reliabilities arebetween 0.89 and 0.96. The reliability of the MAP assessments slightly exceeds that of ALT.This is not unexpected, given the continuously adaptive characteristics of the MAP system.The reliability of the MAP and ALT assessments is consistently high across all subjects andfrom grades two through ten.

The test-retest reliability of the ALT and MAP assessments has also been investigated inseveral large-scale studies. As the name suggests, test-retest reliability is the correlationbetween the scores of two different tests taken by the same student. NWEA districts teststudents multiple times throughout their educational career. The correlation between thepairs of scores of students from spring to fall, spring to spring, and fall to spring cantherefore be calculated.

These correlations serve as a test-retest reliability indicator over a long period of time. Thisprovides an indicator of score consistency throughout the grade spans covered by theassessments. Table 7 displays the test-retest reliability of the NWEA assessments. Valuesrange between 0.79 and 0.94 for all test-retest pairs except for those that involve secondgraders. It is generally known that the assessment of second graders is inconsistent for manyreasons including the mixed reactions that these students have to taking multiple-choice testitems. For these reasons, NWEA test users should expect less consistency in the test scores ofsecond graders.

Page 59: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 55

© 2003 Northwest Evaluation Association, Portland, Oregon

Table 6: Marginal Reliability Estimates for NWEA Assessments

Table 7: Test-Retest Reliability Estimates for NWEA Assessments

last updated 10/11/02

Data Set Year Term 2 3 4 5 6 7 8 9 10

Marginal 1999 MAP Fall r .95 .95 .95 .94 .94 .94 .94 .94 .94

N 4,662 39,590 39,960 40,671 35,508 36,318 34,121 7,620 1,639

Spring r .95 .95 .94 .94 .94 .94 .94 .93 .94

N 10,308 48,566 52,602 54,254 52,696 53,679 43,600 16,619 3,829

Mathematics Fall r .92 .93 .94 .94 .94 .94 .95 .95 .95

N 4,511 37,022 37,237 37,933 33,131 33,664 31,742 7,910 3,313

Spring r .93 .94 .94 .94 .95 .94 .96 .96 .95

N 9,863 47,635 52,580 53,753 52,581 53,631 43,093 16,725 5,583

Lang Usage Fall r .94 .94 .94 .94 .94 .94 .94 .93 --

N 4,292 20,769 21,593 21,980 20,035 19,869 18,630 3,553

Spring r .94 .94 .94 .94 .94 .94 .93 .92 --

N 4,758 19,676 23,167 25,304 23,389 24,290 21,038 5,914

Marginal 1996 ALT Reading r -- .94 .94 .93 .93 .93 .94 .90 --

N 24,623 25,447 27,512 29,664 26,500 24,676 5,045

Mathematics r -- .93 .94 .94 .94 .95 .95 .94 --

N 27,190 28,628 30,109 32,147 28,244 27,380 5,261

Lang Usage r -- .93 .93 .91 .91 .91 .92 .89 --

N 8,954 9,591 9,810 7,587 7,645 8,344 1,641

SpringNWEA Norms Study

Spring

Spring

Reading(Surv w/ Goals)

NWEA Norms Study - (Source for means and standard deviations to calculate marginal reliabilities)

Content AreaTest Type

Grade Level Type Reliability

Data Set Year Term 2 3 4 5 6 7 8 9 10

Test-Retest 1999 ALT Reading r .76 .85 .88 .89 .89 .89 .89 .84 --

N 4,253 27,460 30,091 34,525 30,079 28,386 26,190 9,231

Mathematics r .70 .79 .86 .89 .91 .93 .93 .87 .82

N 4,177 26,522 30,100 34,073 29,730 28,077 24,432 8,788 1,598

Lang Usage r .77 .85 .89 .89 .90 .90 .90 .87 --

N 3,795 14,173 17,285 19,037 16,825 16,822 15,991 3,514

Test-Retest 1999 ALT Reading r .87 .88 .89 .89 .89 .87 .85 .84 --

N 4,632 15,472 16,106 15,517 15,003 14,299 3,752 1,315

Mathematics r .79 .84 .87 .91 .91 .92 .89 .89 --

N 4,585 15,456 16,682 15,302 14,739 13,540 3,864 1,612

Lang Usage r .89 .89 .90 .90 .90 .89 .88 -- --

N 3,749 10,596 11,223 10,623 10,853 10,667 1,445

Test-Retest 1999 ALT Reading r .81 .85 .89 .87 .88 .87 .84 .84 --

N 6,326 22,908 22,294 24,085 26,813 23,756 6,709 2,576

Mathematics r .72 .82 .87 .89 .91 .91 .83 .85 --

N 6,654 23,318 23,183 24,117 26,964 23,828 6,565 2,732Lang Usage r .84 .86 .88 .89 .89 .89 .87 -- --

N 3,749 10,488 11,035 10,386 11,151 10,101 1,588

Grade Level Type Reliability

Test Type

Content Area

NWEA Norms Study Sprg - Sprg

Sprg - Sprg

Sprg - Sprg

Spring to Fall

Spring to Fall

Spring to Fall

NWEA Norms Study

NWEA Norms Study Fall to Spring

Fall to Spring

Fall to Spring

Test-Retest 2002 Reading r .80 .87 .90 .91 .91 .91 .91 .90 .92

N 5,470 48,033 53,797 55,451 52,257 52,804 46,925 14,798 3,121Mathematics r .77 .84 .88 .91 .93 .94 .93 .90 .89

N 5,963 49,806 54,971 56,500 54,325 53,730 46,425 8,971 1,410Lang Usage r -- .88 .90 .91 .92 .92 .92 .91 .90

N 35,994 38,970 38,747 36,826 38,350 33,513 11,393 2,590Test-Retest 2002 Reading r .87 .89 .90 .91 .91 .90 .89 .86 .84

N 18,512 50,241 50,782 52,507 54,207 44,580 10,684 2,621 1,790Mathematics r .83 .87 .90 .91 .93 .93 .85 .79 --

N 19,467 50,536 51,322 53,357 54,170 43,956 12,905 4,939Lang Usage r .89 .89 .90 .91 .91. .92 .90 .89 .88

N 11,197 29,555 31,587 31,317 31,321 28,875 8,500 2,438 1,508

NWEA Norms Study Fall to Spring

Fall to Spring

Fall to Spring

ALT & MAP

NWEA Norms Study Sprg - Sprg

Sprg - Sprg

Sprg - Sprg

ALT & MAP

Page 60: NWEA Technical Manual 2003

56 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

Precision of ScoresAnother indicator of exam performance is the precision of the assessment as measured by theSEM. Figure 12 displays the SEM of the three major NWEA subjects by RIT and by testmodality. Notice that the SEM of most assessments is somewhere between 3 and 3.5 RITpoints. The measurement error for scores at the far extremes of the score range tends toincrease. For RIT scores as high as 260, the SEM is still less than 8 RIT points.

In evaluating the psychometric soundness of this precision, one must consider the concept oftest efficiency as it was explained in the section Educational Assessments. Considering that thelength of the NWEA assessment typically varies between 40 and 50 items, this level ofprecision is quite impressive. It is safe to say that the NWEA assessments are rather efficienttests.

Figure 12: The SEM of NWEA assessments by RIT score and test modality (continues on next page)

MAP and ALT Measurement Error for Mathematics by RIT

Spring 2001

0

1

2

3

4

5

6

7

8

9

135 145 155 165 175 185 195 205 215 225 235 245 255 265 275 285

RIT Score

ALT n=437741

MAP n=117831

Page 61: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 57

© 2003 Northwest Evaluation Association, Portland, Oregon

MAP and ALT Measurement Error for Reading by RITSpring 2001

0

1

2

3

4

5

6

7

8

130 140 150 160 170 180 190 200 210 220 230 240 250 260 270

RIT Score

ALT n=436643

MAP n=155609

MAP and ALT Measurement Error for Language Usage byRIT Spring 2001

0

1

2

3

4

5

6

7

8

130 140 150 160 170 180 190 200 210 220 230 240 250 260 270

RIT Score

ALT n=294672

MAP n=75226

Page 62: NWEA Technical Manual 2003

58 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

It is useful to put meaning to the SEM values that are seen in the graphs in Figure 12. Aswith student scores, there are several useful ways to interpret the standard error seen in theMAP and ALT tests. Two of the most useful ways to look at measurement error (orprecision) are the norm-referenced approach and the curriculum-referenced approach.

Norm-Referenced Precision

One approach to describing the precision of a test score is to compare it to the variability ofachievement in an appropriate sample of students. In this case, an appropriate sample ofstudents is the norming sample used in the 2002 study that established the RIT scale norms.This sample included slightly over 1,050,000 students taking approximately 3,040,000 testsin 321 school districts spread throughout 24 states.

In this norming study, the standard deviation of students’ spring mathematics scores rangedfrom 19.58 to 12.52 depending on the grade examined. The average standard error forstudents’ scores in the same study averaged 3.1 RIT points. As a result, the ratio of thestandard error of the test scores to the standard deviation of mathematics achievement in thesample was between 0.16 and 0.25.

This means most students are pinpointed within 0.32 to 0.50 standard deviations of thestudents in their grade and a 95% confidence interval about a student’s mathematics score isbetween 0.63 and 0.98 standard deviations. The results for reading, language usage, scienceconcepts and processes, and general science are virtually identical to those in mathematics.This level of accuracy, seen in all content areas, is substantially higher than the level neededfor most educational decisions.

Curriculum-Referenced Precision

Another way of describing the precision of the scores for MAP and ALT is to consider thedifference in performance that is expected from students at the extremes of the confidenceinterval around a student’s score. If we identify a third grade student with a mathematicsscore of 185, the extremes of a 95% confidence interval would be 179 and 191.

A student at the low end of this range would typically be challenged by multiplying a 3-digitnumber by a 1-digit number with regrouping. A student at the upper end of this rangewould be slightly less challenged by the multiplication problem above, but would bechallenged by multiplying a 3-digit number by a 2-digit number with regrouping.

Clearly, the difference in the two extremes of the confidence interval define a barelynoticeable difference in the students capabilities. Once again, the accuracy of MAP and ALTscores is quite adequate for virtually any educational decision.

Page 63: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 59

© 2003 Northwest Evaluation Association, Portland, Oregon

A Final Note on MAP and ALT Operating CharacteristicsIt is clear from the reliability, validity, and precision information that the MAP and ALTsystems produce scores for students with the following characteristics:

� They are quite consistent, both within the test being given and across two tests givenat different times.

� They are highly related with external measures of achievement.

� They have a level of precision that surpasses the level needed for educationaldecision making.

These operating characteristics support the application of MAP and ALT scores for a widevariety of educational uses in a variety of settings.

One element of the MAP and ALT systems that might not be clear is that they have thecapacity to adjust to the requirements for which assessment is being done. The sectionTesting For a Purpose discussed the need to have specific types of tests for specific educationalpurposes. Both the MAP and ALT systems have the capacity to deliver tests that fit thepurpose for which assessment is being done. One advantage of families of covalentassessments is that they can be changed without need for new field testing and norming. Asecond advantage is that the characteristics of the altered test can be designed to meet aparticular educational need before the test is actually given.

As an example, the precision of a MAP score is dependent on the length of the test given tothe student. If there is a need for additional precision, the length of a MAP test can beadjusted to shrink the standard error of the scores. If there is a need for decisions aboutperformance categories to be made for all students with a desired level of confidence, thenumber of items given to each student can be customized so that the level of confidencedetermines when the test is terminated. If there is a need to delve more deeply into contentareas in which a student needs additional instruction, the content blueprint of the test can bedesigned to shift dynamically as the student takes the test.

This ability to match the design of the assessment to the educational purpose is one of thestrongest features of the system. It enables ALT and MAP to be used for a variety of purposesfrom low-stakes classroom testing to high-stakes statewide testing. The system meets manycurrent educational needs in ways that a conventional wide-range test cannot. As the needs ofeducation change, MAP and ALT will also allow the assessments to change withoutdistorting the longitudinal information available about our students.

Page 64: NWEA Technical Manual 2003

60 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

Appendix

Initial Scale DevelopmentThe creation of the initial scales was conducted for each of the NWEA subject areas usingthe 1PL IRT model (Lord and Novick, and 1968; Rasch, 1980). The development of theoriginal scales was a multi-stage procedure. The first several stages concerned the utility ofthe 1PL model with the items to be used. A series of experiments concerning the reliability ofstudent scores, the stability of the item difficulties, and the factor structure of the data setconvinced the original researchers that the 1PL model was appropriate for use with the itemsin the pool (Ingebo, 1997). The original set of field trials to create the initial scales used avery conservative four-square linking design (Wright, 1977) that allowed NWEA to createand recursively compare multiple difficulty estimates for each item. This results in a verystrong scale.

Item Bank OverviewTable A1 shows the number of items in the item banks for each subject area. In addition,this table shows the mean and range of item difficulties in each item bank. It should benoted that different items in mathematics have been calibrated with and without the use of acalculator. There is some overlap between these categories, since many items have beencalibrated in one field test without calculators, and in another field test with calculators.These items have two separate calibrations in the item bank, but are counted as one item inthe table. This distinction in the mode of administration is quite important, since the use ofcalculators has been shown (Kingsbury and Houser, 1990) to reduce the difficulty ofmathematics questions dramatically (approximately six RIT points, or slightly under one yearof growth in mathematics for an average student). For this reason, mathematics items shouldnever be used with calculators available unless they were calibrated with calculators available.

Table A1: Basic statistics for item banks used to develop level tests

Number Item difficultyItem bank of Items Minimum Maximum Average

Language 4050 155 243 198.21Math 8461 130 279 205.72Reading 5969 141 247 200.18Science concepts 1121 168 253 201.97General science 692 161 237 203.10

As can be seen in Table A1, the item banks used to create the level tests and MAP tests arequite large. Since an ALT series uses approximately 300 unique items and a MAP series usesapproximately 1200 items, a wide variety of ALT tests and MAP tests may be created from

Page 65: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 61

© 2003 Northwest Evaluation Association, Portland, Oregon

the banks. It is quite unlikely that any two districts will create the same tests, even if they usethe same test design and blueprint.

The range of difficulty of the items in the item banks is quite broad. In the table it can be seenthat the range of item difficulty in any one pool varies from 75 to 149 points on the RIT scale.To describe the breadth of difficulty in the item pools, it is helpful to compare them toachievement growth in students. In the Portland, Oregon Public Schools (a fairly largemetropolitan school district) students grow from a mean reading achievement level ofapproximately 192 in the fall of the third grade to approximately 223 by the spring of the eighthgrade. Similar growth patterns are seen in the other subject areas, and in every case the range ofdifficulty in the item banks is at least twice as great as the change in mean observed studentachievement from the beginning of the third grade to the end of the eighth grade.

In each level test and MAP test, items are chosen from several different content areas accordingto a preset test blueprint. The major goal areas, the number of items in each major goal area, andthe subgoals that are included in each item pool are shown in Table A2. Note that the totalnumber of items in Table A2 does not match Table A1, because not all items are linked to anyspecific goal structure.

Table A2: Goal area coverage in each item bank

LanguageWriting Process Number of items 626

Difficulty Range 162-234Mean Difficulty 198.66

Composition Structure Number of Items 688Difficulty Range 162-234Mean Difficulty 197.97

Grammar/Usage Number of Items 850Difficulty Range 159-243Mean Difficulty 195.77

Punctuation Number of Items 651Difficulty Range 161-235Mean Difficulty 200.68

Capitalization Number of Items 577Difficulty Range 155-235Mean Difficulty 198.54

MathematicsNumber/Numeration Systems Number of Items 1011

Difficulty Range 158-279Mean Difficulty 206.57

Operations/Computation Number of Items 1630Difficulty Range 130-255Mean Difficulty 195.43

Equations/Numerals Number of Items 420Difficulty Range 168-255Mean Difficulty 212.84

Geometry Number of Items 462Difficulty Range 157-264Mean Difficulty 208.64

Page 66: NWEA Technical Manual 2003

62 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

Measurement Number of Items 608Difficulty Range 148-269Mean Difficulty 204.05

Problem Solving Number of Items 614Difficulty Range 154-253Mean Difficulty 208.02

Statistics/Probability Number of Items 524Difficulty Range 153-264Mean Difficulty 204.54

Applications Number of Items 804Difficulty Range 158-269Mean Difficulty 210.45

ReadingWord Meaning Number of Items 841

Difficulty Range 148-243Mean difficulty 195.49

Literal Comprehension Number of Items 1259Difficulty Range 141-247Mean Difficulty 199.03

Interpretive Comprehension Number of Items 1135Difficulty Range 143-240Mean Difficulty 201.21

Evaluative Comprehension Number of Items 776Difficulty Range 157-240Mean Difficulty 205.60

Science ConceptsConcepts Number of Items 566

Difficulty Range 173-234Mean Difficulty 203.51

Processes Number of Items 528Difficulty Range 168-229Mean Difficulty 200.79

General ScienceLife Sciences Number of Items 254

Difficulty Range 171-240Mean Difficulty 203.71

Earth/Space Sciences Number of items 203Difficulty Range 168-230Mean Difficulty 199.60

Physical Sciences Number of Items 235Difficulty Range 166-280Mean Difficulty 208.42

Page 67: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 63

© 2003 Northwest Evaluation Association, Portland, Oregon

Acronym Glossary

ALT Achievement Level Tests

CRF Class Roster File

IEP Individualized Education Plan

IRT Item Response Theory

MAP Measures of Academic Progress

NWEA Northwest Evaluation Association.

RIT Rasch unIT

RMSF Root-Mean-Square Fit

SEM Standard Error of Measurement

SMF Student Master Files

SPF Special Programs File

SRS Scoring and Reporting Software

TPS Test Printing Script

Page 68: NWEA Technical Manual 2003

64 Technical Manual for the NWEA MAP and ALT Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

References

American Educational Research Association, American Psychological Association, NationalCouncil on Measurement in Education. (1999). Standards for educational and psychologicaltesting. Washington, D.C.: American Educational Research Association.

Baker, F. (2001). The basics of item response theory. College Park, MD: ERIC Clearinghouseon Assessment and Evaluation.

Bejar, I. I. (1980). A procedure for investigating the unidimensionality of achievement tests basedon item parameters. Journal of Educational Measurement, 17, 283-296.

Drasgo, F. & Olson-Buchanan J. B. (1999). Innovations in computerized assessments.Mahwah, NJ: Lawrence Erlbaum Associates.

Haladyna, T. M. (1994). Developing and validating multiple-choice test items. Hillsdale, NJ:Lawrence Erlbaum Associates.

Ingebo, G. S. (1997). Probability in the measure of achievement. Chicago, IL: MESA Press.

Kingsbury, G. G. & Houser, R. L. (1988). A comparison of achievement level estimates fromcomputerized adaptive testing and paper-and-pencil testing. Paper presented at the annualmeeting of the American Educational Research Association. New Orleans, LA.

Kingsbury, G. G. & Houser, R. L. (1990). The impact of calculator usage on the difficulty ofmathematics questions. Unpublished manuscript.

Kingsbury, G. G. (April, 2002). An empirical comparison of achievement level estimates fromadaptive tests and paper-and-pencil tests. Paper presented at the annual meeting of theAmerican Educational Research Association, New Orleans, LA

Kingsbury, G. G. (April, 2003). A long-term study of the stability of item parameter estimates.Paper presented at the annual meeting of the American Educational Research Association,Chicago, IL

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale,NJ: Lawrence Erlbaum Associates.

Lord, F. M. and & Novick M. R. (1968). Statistical theories of mental test scores. Menlo Park,CA: Addison-Wesley.

Northwest Evaluation Association. (1996). Scoring and reporting system user’s manual.Portland, OR: Northwest Evaluation Association.

Northwest Evaluation Association. (2002, August). RIT scale norms. Portland, OR:Northwest Evaluation Association.

Page 69: NWEA Technical Manual 2003

Technical Manual for the NWEA MAP and ALT Assessments 65

© 2003 Northwest Evaluation Association, Portland, Oregon

Osterlind, S. J. (1998). Constructing test items: Multiple-choice, constructed response,performance, and other formats. Boston, MA: Kluwer Academic Publishers.

Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago, IL:MESA Press.

Roid, G. H. & Haladyna, T. M. (1997). A technology for test-item writing. New York:Academic Press.

Samejima, F. (1994). Estimation of reliability coefficients using the test informationfunction and its modifications. Applied Psychological Measurement, 18 (3), 229-244.

Warm, A. W. (1989). Weighted likelihood estimation of ability in item response theory withtests of finite length. Pyschometrika, 54, 427-450.

Weiss, D. J. & Kingsbury G. G., (1984). Application of computerized adaptive testing toeducational problems. Journal of Educational Measurement, 21, 361-375.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal ofEducational Measurement, 14, 97-116.

Zara, A. R. (April 1992). A comparison of computerized adaptive and paper-and-pencil versionsof the national registered nurse licensure examination. Paper presented at the annual meetingof the American Educational Research Association, San Francisco.