pisa 2000 technical report - oecd.org · pisa 2000 technical report edited byray adams and margaret...

PISA 2000 Technical Report

EDITED BY Ray Adams AND Margaret Wu

OECDORGANISATION FOR ECONOMIC CO-OPERATION AND DEVELOPMENT

ORGANISATION FOR ECONOMIC CO-OPERATION AND DEVELOPMENT

Pursuant to Article 1 of the Convention signed in Paris on 14th December 1960, and which came intoforce on 30th September 1961, the Organisation for Economic Co-operation and Development (OECD)shall promote policies designed:

– to achieve the highest sustainable economic growth and employment and a rising standard ofliving in Member countries, while maintaining financial stability, and thus to contribute to thedevelopment of the world economy;

– to contribute to sound economic expansion in Member as well as non-member countries in theprocess of economic development; and

– to contribute to the expansion of world trade on a multilateral, non-discriminatory basis inaccordance with international obligations.

The original Member countries of the OECD are Austria, Belgium, Canada, Denmark, France,Germany, Greece, Iceland, Ireland, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain,Sweden, Switzerland, Turkey, the United Kingdom and the United States. The following countriesbecame Members subsequently through accession at the dates indicated hereafter: Japan(28th April 1964), Finland (28th January 1969), Australia (7th June 1971), New Zealand (29th May 1973),Mexico (18th May 1994), the Czech Republic (21st December 1995), Hungary (7th May 1996), Poland(22nd November 1996), Korea (12th December 1996) and the Slovak Republic (14th December 2000). TheCommission of the European Communities takes part in the work of the OECD (Article 13 of the OECDConvention).

© OECD 2002Permission to reproduce a portion of this work for non-commercial purposes or classroom use should be obtainedthrough the Centre français d’exploitation du droit de copie (CFC), 20, rue des Grands-Augustins, 75006 Paris,France, tel. (33-1) 44 07 47 70, fax (33-1) 46 34 67 19, for every country except the United States. In the United Statespermission should be obtained through the Copyright Clearance Center, Customer Service, (508)750-8400,222 Rosewood Drive, Danvers, MA 01923 USA, or CCC Online: www.copyright.com. All other applications forpermission to reproduce or translate all or part of this book should be made to OECD Publications, 2, rue André-Pascal,75775 Paris Cedex 16, France.

histo_gen_a_20x27.fm Page 1 Friday, August 23, 2002 1:32 PM

FOREWORD

Are students well prepared to meet the challenges of the future? Are they able to analyse, reason andcommunicate their ideas effectively? Do they have the capacity to continue learning throughout life?Parents, students, the public and those who run education systems need to know the answers to thesequestions.

Many education systems monitor student learning in order to provide some answers to these questions.Comparative international analyses can extend and enrich the national picture by providing a largercontext within which to interpret national results. They can show countries their areas of relative strengthand weakness and help them to monitor progress and raise aspirations. They can also provide directionsfor national policy, for schools’ curriculum and instructional efforts and for students’ learning. Coupledwith appropriate incentives, they can motivate students to learn better, teachers to teach better, andschools to be more effective.

In response to the need for internationally comparable evidence on student performance, the OECDlaunched the Programme for International Student Assessment (PISA). PISA represents a new commitmentby the governments of OECD countries to monitor the outcomes of education systems in terms of studentachievement on a regular basis and within a common framework that is internationally agreed upon.PISA aims at providing a new basis for policy dialogue and for collaboration in defining and operationalisingeducational goals – in innovative ways that reflect judgements about the skills that are relevant to adultlife. It provides inputs for standard-setting and evaluation; insights into the factors that contribute to thedevelopment of competencies and into how these factors operate in different countries, and it should leadto a better understanding of the causes and consequences of observed skill shortages. By supporting ashift in policy focus from educational inputs to learning outcomes, PISA can assist countries in seeking tobring about improvements in schooling and better preparation for young people as they enter an adultlife of rapid change and deepening global interdependence.

PISA is a collaborative effort, bringing together scientific expertise from the participating countries,steered jointly by their governments on the basis of shared, policy-driven interests. Participating countriestake responsibility for the project at the policy level through a Board of Participating Countries. Expertsfrom participating countries serve on working groups that are charged with linking the PISA policyobjectives with the best available substantive and technical expertise in the field of international comparativeassessment of educational outcomes. Through participating in these expert groups, countries ensure thatthe PISA assessment instruments are internationally valid and take into account the cultural and curricularcontexts of OECD Member countries, that they provide a realistic basis for measurement, and that theyplace an emphasis on authenticity and educational validity. The frameworks and assessment instrumentsfor PISA 2000 are the product of a multi-year development process and were adopted by OECD Membercountries in December 1999.

First results from the PISA 2000 assessment were published in Knowledge and Skills for Life - FirstResults from PISA 2000 (2001). This publication presents evidence on the performance in reading,mathematical and scientific literacy of students, schools and countries, provides insights into the factorsthat influence the development of these skills at home and at school, and examines how these factorsinteract and what the implications are for policy development.

PISA is methodologically highly complex, requiring intensive collaboration among many stakeholders.The successful implementation of PISA depends on the use, and sometimes further development, of state-of-the-art methodologies. The PISA Technical Report describes those methodologies, along with otherfeatures that have enabled PISA to provide high quality data to support policy formation and review. Thedescriptions are provided at a level of detail that will enable review and potentially replication of theimplemented procedures and technical solutions to problems.

The PISA Technical Report is the product of a collaborative effort between the countries participatingin PISA and the experts and the institutions working within the framework of the International PISAConsortium that was contracted by the OECD with the implementation of PISA. The report was preparedby the International PISA Consortium, under the direction of Raymond Adams and Margaret Wu. Authorsand contributors to the individual chapters of this report are identified in the corresponding chapters.

For

ewor

d

3

TABLE OF CONTENTS

CHAPTER 1. THE PROGRAMME FOR INTERNATIONAL STUDENT ASSESSMENT: AN OVERVIEW.......15Managing and implementing PISA ......................................................................................................17This report ..........................................................................................................................................17

READER’S GUIDE.....................................................................................................................................................................19

SECTION ONE: INSTRUMENT DESIGN

CHAPTER 2. TEST DESIGN AND TEST DEVELOPMENT .................................................................21Test scope and test format...................................................................................................................21Timeline ..............................................................................................................................................21Assessment framework and test design................................................................................................22

Development of the assessment frameworks ...................................................................................22Test design ......................................................................................................................................22

Test development team........................................................................................................................23Test development process ....................................................................................................................23

Item submissions from participating countries ................................................................................23Item writing ....................................................................................................................................23The item review process ..................................................................................................................24Pre-pilots and translation input.......................................................................................................24Preparing marking guides and marker training material .................................................................25

PISA field trial and item selection for the main study .........................................................................25Item bank software .............................................................................................................................26PISA 2000 main study.........................................................................................................................26PISA 2000 test characteristics .............................................................................................................27

CHAPTER 3. STUDENT AND SCHOOL QUESTIONNAIRE DEVELOPMENT.................................33Overview.............................................................................................................................................33Development process...........................................................................................................................34

The questions ..................................................................................................................................34The framework ...............................................................................................................................34

Coverage .............................................................................................................................................36Student questionnaire......................................................................................................................36School questionnaire .......................................................................................................................36

Cross-national appropriateness of items..............................................................................................37Cross-curricular competencies and information technology ................................................................37

The cross-curricular competencies instrument.................................................................................37The information technology or computer familiarity instrument ....................................................38

SECTION TWO: OPERATIONS

CHAPTER 4. SAMPLE DESIGN .............................................................................................................39Target population and overview of the sampling design......................................................................39Population coverage, and school and student participation rate standards .........................................40

Coverage of the international PISA target population.....................................................................40Accuracy and precision ...................................................................................................................41School response rates ......................................................................................................................41Student response rates.....................................................................................................................42

Main study school sample...................................................................................................................42Definition of the national target population....................................................................................42The sampling frame ........................................................................................................................43

PIS

A 2

00

0 t

ech

nic

al r

epor

t

4

Stratification ...................................................................................................................................43Assigning a measure of size to each school .....................................................................................48School sample selection...................................................................................................................48Student samples...............................................................................................................................53

CHAPTER 5. TRANSLATION AND CULTURAL APPROPRIATENESS OF THE TEST AND SURVEYMATERIAL .............................................................................................................................................57

Introduction ........................................................................................................................................57Double translation from two source languages ...................................................................................57Development of source versions ..........................................................................................................59PISA translation guidelines..................................................................................................................60Translation training session.................................................................................................................61International verification of the national versions ...............................................................................61Translation and verification procedure outcomes................................................................................63

Linguistic characteristics of English and French source versions of the stimuli ...............................64Psychometric quality of the French and English versions ................................................................66Psychometric quality of the national versions .................................................................................67

Discussion ...........................................................................................................................................69

CHAPTER 6. FIELD OPERATIONS .......................................................................................................71Overview of roles and responsibilities .................................................................................................71

National Project Managers .............................................................................................................71School Co-ordinators ......................................................................................................................72Test administrators..........................................................................................................................72

Documentation ...................................................................................................................................72Materials preparation..........................................................................................................................73

Assembling test booklets and questionnaires...................................................................................73Printing test booklets and questionnaires ........................................................................................73Packaging and shipping materials ...................................................................................................74Receipt of materials at the National Centre after testing ................................................................75

Processing tests and questionnaires after testing..................................................................................75Steps in the marking process ...........................................................................................................75Logistics prior to marking...............................................................................................................77How marks were shown .................................................................................................................78Marker identification numbers........................................................................................................78Design for allocating booklets to markers.......................................................................................78Managing the actual marking .........................................................................................................81Multiple marking ............................................................................................................................82Cross-national marking...................................................................................................................84Questionnaire coding ......................................................................................................................84

Data entry, data checking and file submission.....................................................................................84Data entry .......................................................................................................................................84Data checking .................................................................................................................................84Data submission..............................................................................................................................84After data were submitted...............................................................................................................84

CHAPTER 7. QUALITY MONITORING ..............................................................................................85Preparation of quality monitoring procedures.....................................................................................85Implementing quality monitoring procedures......................................................................................86

Training National Centre Quality Monitors and School Quality Monitors ....................................86National Centre Quality Monitoring ..............................................................................................86School Quality Monitoring .............................................................................................................86Site visit data...................................................................................................................................87 T

able

of

con

ten

ts

5

SECTION THREE: DATA PROCESSING

CHAPTER 8. SURVEY WEIGHTING AND THE CALCULATION OF SAMPLING VARIANCE........89Survey weighting .................................................................................................................................89

The school base weight ...................................................................................................................90The school weight trimming factor .................................................................................................91The student base weight..................................................................................................................91School non-response adjustment .....................................................................................................91Grade non-response adjustment ......................................................................................................94Student non-response adjustment....................................................................................................94Trimming student weights...............................................................................................................95Subject-specific factors for mathematics and science .......................................................................96

Calculating sampling variance.............................................................................................................96The balanced repeated replication variance estimator .....................................................................96Reflecting weighting adjustments ....................................................................................................97Formation of variance strata ...........................................................................................................98Countries where all students were selected for PISA .......................................................................98

CHAPTER 9. SCALING PISA COGNITIVE DATA ................................................................................99The mixed coefficients multinomial logit model..................................................................................99The population model .......................................................................................................................100Combined model...............................................................................................................................101Application to PISA ..........................................................................................................................101National calibrations.........................................................................................................................101

Item response model file (infit mean square).................................................................................102Discrimination coefficients ............................................................................................................102Item-by-country interaction...........................................................................................................102National reports............................................................................................................................102

International calibration....................................................................................................................105Student score generation ...................................................................................................................105

Computing maximum likelihood estimates in PISA ......................................................................105Plausible values .............................................................................................................................105Constructing conditioning variables..............................................................................................107

Analysis of data with plausible values...............................................................................................107

CHAPTER 10. CODING AND MARKER RELIABILITY STUDIES ....................................................109Examining within-country variability in marking .............................................................................109

Homogeneity analysis ...................................................................................................................109Homogeneity analysis with restrictions .........................................................................................112An additional criterion for selecting items.....................................................................................114A comparison between field trial and main study .........................................................................116Variance component analysis ........................................................................................................116

Inter-country rater reliability study - Design .....................................................................................124Recruitment of international markers ...........................................................................................125Inter-country rater reliability study training sessions.....................................................................125Flag files........................................................................................................................................126Adjudication .................................................................................................................................126

CHAPTER 11. DATA CLEANING PROCEDURES ..............................................................................127Data cleaning at the National Centre................................................................................................127Data cleaning at the International Centre .........................................................................................128

Data cleaning organisation............................................................................................................128

PIS

A 2

00

0 t

ech

nic

al r

epor

t

6

Data cleaning procedures ..................................................................................................................128National adaptations to the database............................................................................................128Verifying the student tracking form and the list of schools ...........................................................128Verifying the reliability data..........................................................................................................129Verifying the context questionnaire data.......................................................................................129

Preparing files for analysis ................................................................................................................129Processed cognitive file..................................................................................................................130The student questionnaire .............................................................................................................130The school questionnaire ..............................................................................................................130The weighting files ........................................................................................................................130The reliability files ........................................................................................................................131

SECTION FOUR: QUALITY INDICATORS AND OUTCOMES

CHAPTER 12. SAMPLING OUTCOMES.............................................................................................133Design effect and effective sample size ..............................................................................................141

CHAPTER 13. SCALING OUTCOMES................................................................................................149International characteristics of the item pool ....................................................................................149

Test targeting ................................................................................................................................149Test reliability ...............................................................................................................................152Domain inter-correlations .............................................................................................................152Reading sub-scales ........................................................................................................................153

Scaling outcomes...............................................................................................................................153National item deletions .................................................................................................................153International scaling......................................................................................................................154Generating student scale scores .....................................................................................................154

Differential item functioning .............................................................................................................154Test length.........................................................................................................................................156Magnitude of booklet effects.............................................................................................................157

Correction of the booklet effect ....................................................................................................160

CHAPTER 14. OUTCOMES OF MARKER RELIABILITY STUDIES .................................................163Within-country reliability studies ......................................................................................................163

Variance components analysis .......................................................................................................163Generalisability coefficients...........................................................................................................163

Inter-country rater reliability study - Outcome..................................................................................174Overview of the main results ........................................................................................................174Main results by item .....................................................................................................................174Main study by country..................................................................................................................176

CHAPTER 15. DATA ADJUDICATION ...............................................................................................179Overview of response rate issues .......................................................................................................179Follow-up data analysis approach.....................................................................................................181Review of compliance with other standards......................................................................................182Detailed country comments...............................................................................................................182

Summary.......................................................................................................................................182Australia .......................................................................................................................................182Austria ..........................................................................................................................................183Belgium .........................................................................................................................................183Brazil.............................................................................................................................................184Canada..........................................................................................................................................184Czech Republic .............................................................................................................................184

Tab

le o

f co

nte

nts

7

Denmark .......................................................................................................................................184Finland..........................................................................................................................................185France ...........................................................................................................................................185Germany .......................................................................................................................................185Greece ...........................................................................................................................................185Hungary........................................................................................................................................185Iceland ..........................................................................................................................................185Ireland...........................................................................................................................................185Italy...............................................................................................................................................185Japan.............................................................................................................................................185Korea ............................................................................................................................................186Latvia............................................................................................................................................186Liechtenstein .................................................................................................................................186Luxembourg..................................................................................................................................186Mexico..........................................................................................................................................186Netherlands...................................................................................................................................186New Zealand ................................................................................................................................187Norway.........................................................................................................................................189Poland...........................................................................................................................................189Portugal ........................................................................................................................................189Russian Federation........................................................................................................................189Spain .............................................................................................................................................190Sweden..........................................................................................................................................190Switzerland ...................................................................................................................................190United Kingdom............................................................................................................................191United States .................................................................................................................................191

SECTION FIVE: SCALE CONSTRUCTION AND DATA PRODUCTS

CHAPTER 16. PROFICIENCY SCALES CONSTRUCTION ...............................................................195Introduction ......................................................................................................................................195Development of the described scales .................................................................................................195

Stage 1: Identifying possible sub-scales .........................................................................................195Stage 2: Assigning items to scales..................................................................................................196Stage 3: Skills audit .......................................................................................................................196Stage 4: Analysing field trial data..................................................................................................196Stage 5: Defining the dimensions ..................................................................................................196Stage 6: Revising and refining main study data.............................................................................196Stage 7: Validating ........................................................................................................................197

Defining proficiency levels ................................................................................................................197Reading literacy ................................................................................................................................199

Scope.............................................................................................................................................199Three reading proficiency scales....................................................................................................199Task variables ...............................................................................................................................200Task variables for the combined reading literacy scale..................................................................200Task variables for the retrieving information sub-scale .................................................................201Task variables for the interpreting texts sub-scale .........................................................................201Task variables for the reflection and evaluation sub-scale.............................................................202Retrieving information sub-scale...................................................................................................204Interpreting texts sub-scale............................................................................................................205Reflection and evaluation sub-scale...............................................................................................206Combined reading literacy scale....................................................................................................207

PIS

A 2

00

0 t

ech

nic

al r

epor

t

8

Cut-off points for reading .............................................................................................................207Mathematical literacy........................................................................................................................208

What is being assessed?.................................................................................................................208Illustrating the PISA mathematical literacy scale ...........................................................................210

Scientific literacy ...............................................................................................................................211What is being assessed?.................................................................................................................211Illustrating the PISA scientific literacy scale ..................................................................................213

CHAPTER 17. CONSTRUCTING AND VALIDATING THE QUESTIONNAIRE INDICES .............217Validation procedures of indices .......................................................................................................218Indices ...............................................................................................................................................219

Student characteristics and family background .............................................................................219Parental interest and family relations ............................................................................................221Family possessions ........................................................................................................................224Instruction and learning ................................................................................................................226School and classroom climate .......................................................................................................228Reading habits ..............................................................................................................................230Motivation and interest.................................................................................................................232Learning strategies ........................................................................................................................234Learning style................................................................................................................................237Self-concept...................................................................................................................................238Learning confidence ......................................................................................................................240Computer familiarity or information technology ..........................................................................242Basic school characteristics ...........................................................................................................244School policies and practices .........................................................................................................245School climate ...............................................................................................................................246School resources............................................................................................................................249

CHAPTER 18. INTERNATIONAL DATABASE...................................................................................253Files in the database ..........................................................................................................................253

The student files............................................................................................................................253The school file...............................................................................................................................254The assessment items data file.......................................................................................................254

Records in the database ....................................................................................................................254Records included in the database ..................................................................................................254Records excluded in the database .................................................................................................254

Representing missing data .................................................................................................................254How are students and schools identified? .........................................................................................255The student files ................................................................................................................................255The cognitive files .............................................................................................................................256The school file...................................................................................................................................256Further information ..........................................................................................................................257

REFERENCES ......................................................................................................................................259

APPENDIX 1. Summary of PISA reading literacy items........................................................................262

APPENDIX 2. Summary of PISA mathematical literacy items ..............................................................266

APPENDIX 3. Summary of PISA scientific literacy items......................................................................267

Tab

le o

f co

nte

nts

9

APPENDIX 4. The validity studies of occupation and socio-economic status (Summary) ....................268Canada..............................................................................................................................................268Czech Republic .................................................................................................................................268France ...............................................................................................................................................268United Kingdom................................................................................................................................269

APPENDIX 5. The main findings from quality monitoring of the PISA 2000 main study....................270School Quality Monitor Reports.......................................................................................................270Preparation for the assessment ..........................................................................................................272Beginning the testing sessions............................................................................................................273Conducting the testing sessions .........................................................................................................273The student questionnaire session .....................................................................................................274General questions concerning the assessment ....................................................................................274Summary...........................................................................................................................................275Interview with the PISA school co-ordinator.....................................................................................275National Centre Quality Monitor Reports ........................................................................................277Staffing..............................................................................................................................................277School Quality Monitors...................................................................................................................278Markers ............................................................................................................................................278Administrative Issues.........................................................................................................................278Translation and Verification..............................................................................................................279Adequacy of the manuals ..................................................................................................................280Conclusion ........................................................................................................................................281

APPENDIX 6. Order effects figures ......................................................................................................282

APPENDIX 7. PISA 2000 Expert Group membership and Consortium staff ........................................286PISA Consortium ..............................................................................................................................286

Australian Council for Educational Research................................................................................286Westat ...........................................................................................................................................286Citogroep ......................................................................................................................................286Educational Testing Service ...........................................................................................................286Other experts ................................................................................................................................286

Reading Functional Expert Group ....................................................................................................287Mathematics Functional Expert Group .............................................................................................287Science Functional Expert Group ......................................................................................................287Cultural Review Panel.......................................................................................................................287Technical Advisory Group.................................................................................................................287

APPENDIX 8. Contrast coding for PISA 2000 conditioning variables..................................................288

APPENDIX 9. Sampling forms .............................................................................................................305

APPENDIX 10. Student listing form.....................................................................................................316

APPENDIX 11. Student tracking form..................................................................................................318

APPENDIX 12. Adjustment to BRR for strata with odd numbers of primary sampling units ..............320

PIS

A 2

00

0 t

ech

nic

al r

epor

t

10

Tab

le o

f co

nte

nts

11

List of Figures

1. PISA 2000 in a Nutshell...................................................................................................................162. Test Booklet Design for the Main Survey.........................................................................................233. Mapping the Coverage of the PISA 2000 Questionnaires ................................................................354 School Response Rate Standards - Criteria for Acceptability ...........................................................425. Student Exclusion Criteria as Conveyed to National Project Managers ...........................................566. Allocation of the Booklets for Single Marking of Reading by Cluster..............................................797. Booklet Allocation for Single Marking of Mathematics and Science by Cluster, Common Markers808. Booklet Allocation for Single Marking of Mathematics and Science by Cluster, Separate Markers .809. Booklet Allocation to Markers for Multiple Marking of Reading....................................................8210. Booklet Allocation for Multiple Marking of Mathematics and Science, Common Markers.............8311. Booklet Allocation for Multiple Marking of Mathematics...............................................................8312. Example of Item Statistics Shown in Report 1 ...............................................................................10313. Example of Item Statistics Shown in Report 2 ...............................................................................10314. Example of Item Statistics Shown in Report 3 ...............................................................................10415. Example of Item Statistics Shown in Report 4 ...............................................................................10416. Example of Item Statistics Shown in Report 5 ...............................................................................10517. Notational System..........................................................................................................................11018. Quantification of Categories ..........................................................................................................11119. H* and H** for Science Items .......................................................................................................11420. Homogeneity and Departure from Uniformity...............................................................................11621. Comparison of Homogeneity Indices for Reading Items in the Main Study and the Field Trial in

Australia........................................................................................................................................11722. Comparison of Reading Item Difficulty and Achievement Distributions........................................15123. Comparison of Mathematics Item Difficulty and Achievement Distributions ................................15124. Comparison of Science Item Difficulty and Achievement Distributions .........................................15225. Items Deleted for Particular Countries ...........................................................................................15326. Comparison of Item Parameter Estimates for Males and Females in Reading................................15527. Comparison of Item Parameter Estimates for Males and Females in Mathematics ........................15528. Comparison of Item Parameter Estimates for Males and Females in Science .................................15529. Item Parameter Differences for the Items in Cluster One ...............................................................16130. Plot of Attained PISA School Response Rates ................................................................................18131. What it means to be at a Level ......................................................................................................19832. The Retrieving Information Sub-Scale............................................................................................20433. The Interpreting Texts Sub-Scale....................................................................................................20534. The Reflection and Evaluation Sub-Scale .......................................................................................20635. The Combined Reading Literacy Scale...........................................................................................20736. Combined Reading Literacy Scale Level Cut-Off Points ................................................................20737. The Mathematical Literacy Scale ...................................................................................................20938. Typical Mathematical Literacy Tasks .............................................................................................21139. The Scientific Literacy Scale...........................................................................................................21240. Characteristics of the Science Unit, Semmelweis ............................................................................21441. Characteristics of the Science Unit, Ozone.....................................................................................21542. Item Parameter Estimates for the Index of Cultural Communication (CULTCOM) ......................22143. Item Parameter Estimates for the Index of Social Communication (SOCCOM) ............................22244. Item Parameter Estimates for the Index of Family Educational Support (FAMEDSUP) .................22245. Item Parameter Estimates for the Index of Cultural Activities (CULTACTV).................................22246. Item Parameter Estimates for the Index of Family Wealth (WEALTH) ..........................................22447. Item Parameter Estimates for the Index of Home Educational Resources (HEDRES)....................22448. Item Parameter Estimates for the Index of Possessions Related to ‘Classical Culture’ in the Family

Home (CULTPOSS) .......................................................................................................................225

49. Item Parameter Estimates for the Index of Time Spent on Homework (HMWKTIME).................22650. Item Parameter Estimates for the Index of Teacher Support (TEACHSUP)....................................22651. Item Parameter Estimates for the Index of Achievement Press (ACHPRESS) .................................22752. Item Parameter Estimates for the Index of Disciplinary Climate (DISCLIM).................................22853. Item Parameter Estimates for the Index of Teacher-Student Relations (STUDREL) .......................22954. Item Parameter Estimates for the Index of Sense of Belonging (BELONG)....................................22955. Item Parameter Estimates for the Index of Engagement in Reading (JOYREAD) ..........................23156. Item Parameter Estimates for the Index of Reading Diversity (DIVREAD)....................................23157. Item Parameter Estimates for the Index of Instrumental Motivation (INSMOT)...........................23258. Item Parameter Estimates for the Index of Interest in Reading (INTREA).....................................23359. Item Parameter Estimates for the Index of Interest in Mathematics (INTMAT) ............................23360. Item Parameter Estimates for the Index of Control Strategies (CSTRAT) ......................................23561. Item Parameter Estimates for the Index of Memorisation (MEMOR) ...........................................23562. Item Parameter Estimates for the Index of Elaboration (ELAB).....................................................23563. Item Parameter Estimates for the Index of Effort and Perseverance (EFFPER) ..............................23664. Item Parameter Estimates for the Index of Co-operative Learning (COPLRN)..............................23765. Item Parameter Estimates for the Index of Competitive Learning (COMLRN)..............................23766. Item Parameter Estimates for the Index of Self-Concept in Reading (SCVERB).............................23967. Item Parameter Estimates for the Index of Self-Concept in Mathematics (MATCON) ..................23968. Item Parameter Estimates for the Index of Academic Self-Concept (SCACAD) .............................23969. Item Parameter Estimates for the Index of Perceived Self-Efficacy (SELFEF) .................................24170. Item Parameter Estimates for the Index of Control Expectation (CEXP).......................................24171. Item Parameter Estimates for the Index of Comfort With and Perceived Ability to Use Computers

(COMAB) .............................................................................................................................................24272. Item Parameter Estimates for the Index of Computer Usage (COMUSE) ......................................24373. Item Parameter Estimates for the Index of Interest in Computers (COMATT) ..............................24374. Item Parameter Estimates for the Index of School Autonomy (SCHAUTON) ...............................24575. Item Parameter Estimates for the Index of Teacher Autonomy (TCHPARTI) ................................24576. Item Parameter Estimates for the Index of Principals’ Perceptions of Teacher-Related Factors

Affecting School Climate (TEACBEHA) ........................................................................................24777. Item Parameter Estimates for the Index of Principals’ Perceptions of Student-Related Factors

Affecting School Climate (STUDBEHA) ........................................................................................24778. Item Parameter Estimates for the Index of Principals’ Perceptions of Teachers' Morale and

Commitment (TCMORALE) .........................................................................................................24779. Item Parameter Estimates for the Index of the Quality of the Schools’ Educational Resources

(SCMATEDU)................................................................................................................................24980. Item Parameter Estimates for the Index of the Quality of the Schools’ Physical Infrastructure

(SCMATBUI) .................................................................................................................................24981. Item Parameter Estimates for the Index of Teacher Shortage (TCSHORT)....................................25082. Duration of Questionnaire Session, Showing Histogram and Associated Summary Statistics ........27183. Duration of Total Student Working Time, with Histogram and Associated Summary Statistics ....27284. Item Parameter Differences for the Items in Cluster One ..............................................................28285. Item Parameter Differences for the Items in Cluster Two ..............................................................28386. Item Parameter Differences for the Items in Cluster Three ...........................................................28387. Item Parameter Differences for the Items in Cluster Four .............................................................28388. Item Parameter Differences for the Items in Cluster Five ..............................................................28489. Item Parameter Differences for the Items in Cluster Six ................................................................28490. Item Parameter Differences for the Items in Cluster Seven ............................................................28491. Item Parameter Differences for the Items in Cluster Eight ............................................................28592. Item Parameter Differences for the Items in Cluster Nine .............................................................285

PIS

A 2

00

0 t

ech

nic

al r

epor

t

12

List of Tables

1. Test Development Timeline..............................................................................................................222. Distribution of Reading Items by Text Structure and by Item Type.................................................273. Distribution of Reading Items by Type of Task (Process) and by Item Type ....................................274. Distribution of Reading Items by Text Type and by Item Type........................................................275. Distribution of Reading Items by Context and by Item Type ..........................................................286. Items from IALS ..............................................................................................................................287. Distribution of Mathematics Items by Overarching Concepts and by Item Type.............................288. Distribution of Mathematics Items by Competency Class and by Item Type ...................................299. Distribution of Mathematics Items by Mathematical Content Strands and by Item Type................2910. Distribution of Mathematics Items by Context and by Item Type ...................................................2911. Distribution of Science Items by Science Processes and by Item Type..............................................3012. Distribution of Science Items by Science Major Area and by Item Type ..........................................3013. Distribution of Science Items by Science Content Application and by Item Type ............................3014. Distribution of Science Items by Context and by Item Type ............................................................3115. Stratification Variables.....................................................................................................................4516. Schedule of School Sampling Activities............................................................................................5117. Percentage of Correct Answers in English and French-Speaking Countries or Communities for

Groups of Test Units with Small or Large Differences in Length of Stimuli in the Source Languages........................................................................................................................................65

18. Correlation of Linguistic Difficulty Indicators in 26 English and French Texts ...............................6619. Percentage of Flawed Items in the English and French National Versions .......................................6720. Percentage of Flawed Items by Translation Method ........................................................................6821. Differences Between Translation Methods .......................................................................................6922. Non-Response Classes .....................................................................................................................9223. School and Student Trimming .........................................................................................................9524. Some Examples of ∆ ......................................................................................................................11525. Contribution of Model Terms to (Co) Variance.............................................................................12026. Coefficients of the Three Variance Components (General Case)....................................................12227. Coefficients of the Three Variance Components (Complete Case) .................................................12328. Estimates of Variance Components (x100) in the United States ....................................................12329. Variance Components (%) for Mathematics in the Netherlands....................................................12430. Estimates of ρ3 for Mathematics Scale in the Netherlands.............................................................12431. Sampling and Coverage Rates .......................................................................................................13532. School Response Rates Before Replacements.................................................................................13833. School Response Rates After Replacements...................................................................................13934. Student Response Rates After Replacements .................................................................................14035. Design Effects and Effective Sample Sizes for the Mean Performance on the Combined Reading

Literacy Scale.................................................................................................................................14336. Design Effects and Effective Sample Sizes for the Mean Performance on the Mathematical Literacy Scale .......14437. Design Effects and Effective Sample Sizes for the Mean Performance on the Scientific Literacy Scale...................14538. Design Effects and Effective Sample Sizes for the Percentage of Students at Level 3 on the

Combined Reading Literacy Scale .................................................................................................14639. Design Effects and Effective Sample Sizes for the Percentage of Students at Level 5 on the

Combined Reading Literacy Scale .................................................................................................14740. Number of Sampled Students by Country and Booklet .................................................................15041. Reliability of the Three Domains Based Upon Unconditioned Unidimensional Scaling .................15242. Latent Correlation Between the Three Domains............................................................................15343. Correlation Between Scales............................................................................................................15344. Final Reliability of the PISA Scales ................................................................................................15445. Average Number of Not-Reached Items and Missing Items by Booklet and Testing Session .........156

Tab

le o

f co

nte

nts

13

46. Average Number of Not-Reached Items and Missing Items by Country and Testing Session ........15747. Distribution of Not-Reached Items by Booklet, First Testing Session ............................................15748. Distribution of Not-Reached Items by Booklet, Second Testing Session ........................................15849. Mathematics Means for Each Booklet and Country ......................................................................15850. Reading Means for Each Booklet and Country .............................................................................15951. Science Means for Each Booklet and Country ...............................................................................16052. Booklet Difficulty Parameters ........................................................................................................16153. Booklet Difficulty Parameters Reported on the PISA Scale ............................................................16154. Variance Components for Mathematics.........................................................................................16455. Variance Components for Science..................................................................................................16556. Variance Components for Retrieving Information .........................................................................16657. Variance Components for Interpreting Texts .................................................................................16758. Variance Components for Reflection and Evaluation ....................................................................16859. ρ3-Estimates for Mathematics .......................................................................................................16960. ρ3-Estimates for Science.................................................................................................................17061. ρ3-Estimates for Retrieving Information ........................................................................................17162. ρ3-Estimates for Interpreting Texts ................................................................................................17263. ρ3-Estimates for Reflection and Evaluation ...................................................................................17364. Summary of Inter-Country Rater Reliability Study........................................................................17465. Summary of Item Characteristics for Inter-Country Rater Reliability Study ..................................17566. Items with Less than 90 Per Cent Consistency ..............................................................................17667. Inter-Country Reliability Study Items with More than 25 Per Cent Inconsistency Rate in at Least

One Country .................................................................................................................................17768. Inter-Country Summary by Country..............................................................................................17869. The Potential Impact of Non-response on PISA Proficiency Estimates ..........................................18070. Performance in Reading by Grade in the Flemish Community of Belgium ....................................18371. Average of the School Percentage of Over-Aged Students ..............................................................18472. Results for the New Zealand Logistic Regression..........................................................................18873. Results for the United States Logistic Regression...........................................................................19374. Reliability of Parental Interest and Family Relations Scales ..........................................................22375. Reliability of Family Possessions Scales .........................................................................................22576. Reliability of Instruction and Learning Scales ...............................................................................22777. Reliability of School and Classroom Climate Scales ......................................................................23078. Reliability of Reading Habits Scales ..............................................................................................23279. Reliability of Motivation and Interest Scales .................................................................................23480. Reliability of Learning Strategies Scales.........................................................................................23681. Reliability of Learning Style Scales ................................................................................................23882. Reliability of Self-Concept Scales...................................................................................................24083. Reliability of Learning Confidence Scales......................................................................................24284. Reliability of Computer Familiarity or Information Technology Scales .........................................24485. Reliability of School Policies and Practices Scales .........................................................................24686. Reliability of School Climate Scales...............................................................................................24887. Reliability of School Resources Scales ...........................................................................................25088. Validity Study Results for France...................................................................................................26889. Validity Study Results for the United Kingdom .............................................................................26990. Responses to the School Co-ordinator Interview ...........................................................................27691. Number of Students per School Refusing to Participate in PISA Prior to Testing ..........................27792. Item Assignment to Reading Clusters ............................................................................................282

PIS

A 2

00

0 t

ech

nic

al r

epor

t

14

The Programme ForInternational StudentAssessment: An Overview

Ray Adams

The OECD Programme for International StudentAssessment (PISA) is a collaborative effortamong OECD Member countries to measurehow well 15-year-old young adults approachingthe end of compulsory schooling are prepared tomeet the challenges of today’s knowledgesocieties.1 The assessment is forward-looking:rather than focusing on the extent to which thesestudents have mastered a specific schoolcurriculum, it looks at their ability to use theirknowledge and skills to meet real-life challenges.This orientation reflects a change in curriculargoals and objectives, which are increasinglyconcerned with what students can do with whatthey learn at school.

The first PISA survey was conducted in 2000in 32 countries (including 28 OECD Membercountries) using written tasks answered inschools under independently supervised testconditions. Another 11 countries will completethe same assessment in 2002. PISA 2000surveyed reading, mathematical and scientificliteracy, with a primary focus on reading.Measures of attitudes to learning, andinformation on how students manage their ownlearning were also obtained in 25 countries aspart of an international option. The survey willbe repeated every three years, with the primaryfocus shifting to mathematics in 2003, science in2006 and back to reading in 2009.

In addition to the assessments, PISA 2000included Student and School Questionnaires tocollect data that could be used in constructingindicators pointing to social, cultural, economicand educational factors that are associated withstudent performance. Using the data taken fromthese two questionnaires, analyses linkingcontext information with student achievementcould address differences:• between countries in the relationships between

student-level factors (such as gender andsocial background) and achievement;

• in the relationships between school-levelfactors and achievement across countries;

• in the proportion of variation in achievementbetween (rather than within) schools, anddifferences in this value across countries;

• between countries in the extent to whichschools moderate or increase the effects ofindividual-level student factors and studentachievement;

• in education systems and national contextthat are related to differences in studentachievement across countries; and

• in the future, changes in any or all of theserelationships over time.Through the collection of such information at

the student and school level on a cross-nationally comparable basis, PISA addssignificantly to the knowledge base that waspreviously available from national officialstatistics, such as aggregate national statistics onthe educational programs completed and thequalifications obtained by individuals.

The ambitious goals of PISA come at a cost:PISA is both resource intensive and

1 In most OECD countries, compulsory schoolingends at age 15 or 16; in the United States it ends atage 17, and in Belgium, Germany and theNetherlands, at age 18 (OECD, 2001).

Chapter 1

methodologically highly complex, requiringintensive collaboration among manystakeholders. The successful implementation ofPISA depends on the use, and sometimes furtherdevelopment, of state-of-the-art methodologies.This report describes some of thosemethodologies, along with other features that

have enabled PISA to provide high quality datato support policy formation and review.

Figure 1 provides an overview of the centraldesign elements of PISA. The remainder of thisreport describes these design elements and theassociated procedures in more detail.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

16

Figure 1: PISA 2000 in a Nutshell

Sample size• More than a quarter of a million students, representing almost 17 million 15-year-olds enrolled in

the schools of the 32 participating countries, were assessed in 2000. Another 11 countries willadminister the same assessment in 2002.

Content• PISA 2000 covered three domains: reading literacy, mathematical literacy and scientific literacy.

• PISA 2000 looked at young people’s ability to use their knowledge and skills in order to meet real-life challenges rather than how well they had mastered a specific school curriculum.

• The emphasis was placed on the mastery of processes, the understanding of concepts, and the abilityto function in various situations within each domain.

• As part of an international option taken up in 25 countries, PISA 2000 collected information onstudents’ attitudes to learning.

Methods• PISA 2000 used pencil-and-paper assessments, lasting two hours for each student.

• PISA 2000 used both multiple-choice items and questions requiring students to construct their ownanswers. Items were typically organised in units based on a passage describing a real-life situation.

• A total of seven hours of assessment items was included, with different students taking differentcombinations of the assessment items.

• Students answered a background questionnaire that took about 30 minutes to complete and, as partof an international option, completed questionnaires on learning and study practices as well asfamiliarity with computers.

• School principals completed a questionnaire about their school.

Outcomes• A profile of knowledge and skills among 15-year-olds.

• Contextual indicators relating results to student and school characteristics.

• A knowledge base for policy analysis and research.

• Trend indicators showing how results change over time, once data become available from subsequentcycles of PISA.

Future assessments• PISA will continue in three-year cycles. In 2003, the focus will be on mathematics and in 2006 on

science. The assessment of cross-curricular competencies is being progressively integrated into PISA,beginning with an assessment of problem-solving skills in 2003.

Managing andImplementing PISA

The design and implementation of PISA 2000was the responsibility of an internationalConsortium led by the Australian Council forEducational Research (ACER). The otherpartners in this Consortium have been theNational Institute for Educational Measurement(Cito Group) in the Netherlands, the Service dePédagogie Expérimentale at Université de Liègein Belgium, Westat and the Educational TestingService (ETS) in the United States and theNational Institute for Educational Research(NIER) in Japan. Appendix 7 lists the manyConsortium staff and consultants who havemade important contributions to thedevelopment and implementation of the project.

The Consortium implements PISA within aframework established by a Board ofParticipating Countries (BPC) which includesrepresentation from all countries at senior policylevels. The BPC established policy priorities andstandards for developing indicators, forestablishing assessment instruments, and forreporting results. Experts from participatingcountries served on working groups linking theprogramme policy objectives with the bestinternationally available technical expertise inthe three assessment areas. These expert groupswere referred to as Functional Expert Groups(FEGs) (see Appendix 7 for members). Byparticipating in these expert groups andregularly reviewing outcomes of the groups’meetings, countries ensured that the instrumentswere internationally valid and that they tookinto account the cultural and educationalcontexts of the different OECD Membercountries, that the assessment materials hadstrong measurement potential, and that theinstruments emphasised authenticity andeducational validity.

Participating countries implemented PISAnationally through National Project Managers(NPMs), who respected common technical andadministrative procedures. These managersplayed a vital role in developing and validatingthe international assessment instruments andensured that PISA implementation was of highquality. The NPMs also contributed to theverification and evaluation of the survey results,analyses and reports.

The OECD Secretariat had overallresponsibility for managing the programme. Itmonitored its implementation on a day-to-daybasis, served as the secretariat for the BPC,fostered consensus building between thecountries involved, and served as the interlocutorbetween the BPC and the internationalConsortium.

This Report

This Technical Report does not report the resultsof PISA. The first results from PISA werepublished in December 2001 in Knowledge andSkills for Life (OECD, 2001) and a sequence ofthematic reports covering topics such as: SocialBackground and Student Achievement; TheDistribution and Impact of Strategies of Self-Regulated Learning and Self-Concept;Engagement and Motivation; and School FactorsRelated to Quality and Equity is planned forpublication in the coming months.

This Technical Report is designed to describethe technical aspects of the project at a sufficientlevel of detail to enable review and potentiallyreplication of the implemented procedures andtechnical solutions to problems. The report isbroken into five sections:• Section One—Instrument Design: Covers the

design and development of both thequestionnaires and achievement tests.

• Section Two—Operations: Covers theoperational procedures for the sampling andpopulation definitions, test administrationprocedures, quality monitoring and assuranceprocedures for test administration andnational centre operations, and instrumenttranslation.

• Section Three—Data Processing: Covers themethods used in data cleaning andpreparation, including the methods forweighting and variance estimation, scalingmethods, methods for examining inter-ratervariation and the data cleaning steps.

• Section Four—Quality Indicators andOutcomes: Covers the results of the scalingand weighting, reports response rates andrelated sampling outcomes and gives theoutcomes of the inter-rater reliability studies.The last chapter in this section summarises theoutcomes of the PISA 2000 data

An

ove

rvie

w

17

adjudication—that is, the overall analysis ofdata quality for each country.

• Section Five—Scale Construction and DataProducts: Describes the construction of thePISA 2000 described levels of proficiency andthe construction and validation ofquestionnaire-related indices. The finalchapter briefly describes the contents of thePISA 2000 database.

• Appendices: Detailed appendices of resultspertaining to the chapters of the report areprovided.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

18

Rea

der

’s g

uid

e

19

Reader’s Guide

List of abbreviations

The following abbreviations are used in this report:

ACER Australian Council for Educational ResearchAGFI Adjusted Goodness-of-Fit IndexBPC PISA Board of Participating CountriesBRR Balanced Repeated ReplicationCCC PISA 2000 questionnaire on self-regulated learningCFA Confirmatory Factor AnalysisCFI Comparative Fit IndexCIVED Civic Education StudyDIF Differential Item FunctioningENR Enrolment of 15-year-oldsETS Educational Testing ServiceFEG Functional Expert GroupI Sampling IntervalIALS International Adult Literacy SurveyICR Inter-Country Rater Reliability StudyIEA International Association for the Evaluation of Educational AchievementISCED International Standard Classification of EducationISCO International Standard Classification of OccupationsISEI International Socio-Economic IndexIT PISA questionnaire on computer familiarityMENR Enrolment for moderately small schoolsMOS Measure of sizeNCQM National Centre Quality MonitorNDP National Desired PopulationNEP National Enrolled PopulationNFI Normed Fit IndexNNFI Non-Normed Fit IndexNPM National Project ManagerPISA Programme for International Student AssessmentPPS Probability Proportional to SizePSU Primary Sampling UnitsRMSEA Root Mean Square Error of ApproximationRN Random NumberSC School Co-ordinatorSD Standard DeviationSEM Structural Equation ModellingSES Socio-Economic StatusSQM School Quality MonitorTA Test AdministratorTAG Technical Advisory GroupTCS Target Cluster SizeTIMSS Third International Mathematics and Science StudyTIMSS-R Third International Mathematics and Science Study - RepeatTOEFL Test of English as a Foreign LanguageVENR Enrolment for very small schoolsWLE Weighted Likelihood Estimates

Further Information

For further information on the PISA assessment instruments, the PISA database and methods used seealso:

Knowledge and Skills for Life – First Results from PISA 2000 (OECD, 2001);

Knowledge and Skills for Life – A New Framework for Assessment (OECD, 1999a);

Manual for the PISA 2000 Database (OECD, 2002a);

Sample Tasks from the PISA 2000 Assessment – Reading, Mathematical and Scientific Literacy (OECD, 2002b); and

The PISA Web site (www.pisa.oecd.org).

Test DesigN and Test Development

Margaret Wu

Test Scope and Test Format

PISA 2000 had three subject domains, withreading as the major domain, and mathematicsand science as minor domains. Student achievementin reading was assessed using 141 itemsrepresenting approximately 270 minutes oftesting time. The mathematics assessmentconsisted of 32 items, and the science assessmentconsisted of 35 items, representingapproximately 60 minutes of testing time foreach.

The materials used in the main study wereselected from a larger pool of approximately600 items tested in a field trial conducted in allcountries one year prior to the main study. Thepool included 16 items from the InternationalAdult Literacy Survey (IALS, OECD, 2000) andthree items from the Third InternationalMathematics and Science Study (TIMSS, Beatonet al., 1996). The IALS items were included toallow for possible linking of PISA results toIALS.1

PISA 2000 was a paper-and-pencil test, witheach student undertaking two hours of testing(i.e., answering one of the nine booklets). Thetest items were multiple-choice, short answer, andextended response. Multiple-choice items wereeither standard multiple-choice with a limitednumber (usually four or five) of responses fromwhich students were required to select the bestanswer, or complex multiple-choice presentingseveral statements from which students wererequired to give one of several possible responses(true/false, correct/incorrect, etc.). Closed

1 Note that after consideration by the Functional ExpertGroups (FEGs), National Project Managers (NPMs)and the Board of Participating Countries (BPC), a linkwith TIMSS was not pursued in the main study.

constructed-response items generally requiredstudents to construct a response within very limitedconstraints, such as mathematics items requiringa numeric answer, or items requiring a word orshort phrase, etc. Short response items were similarto closed constructed-response items, but had awide range of possible responses. Open constructed-response items required more extensive writing,or showing a calculation, and frequently includedsome explanation or justification.

Pencils, erasers, rulers, and, in some cases,calculators, were provided. The Consortiumrecommended that calculators be provided incountries where they were routinely used in theclassroom. National centres decided whethercalculators should be provided for their studentson the basis of standard national practice. Noitems in the pool required a calculator, but someitems involved solution steps for which the use ofa calculator could facilitate computation. Indeveloping the mathematics items, test developerswere particularly mindful to ensure that the itemswere as calculator-neutral as possible.

Timeline

The project started in January 1998, when someinitial conception of the frameworks wasdiscussed. The formal process of test develop-ment began after the first Functional ExpertGroups’ (FEGs) meetings in March 1998. Themain phase of the test item development finishedwhen the items were distributed for the field trialin November 1998. During this eight-monthperiod, intensive work was carried out in writingand reviewing items, and on pre-pilot activities.

section one: instrument design

Chapter 2

PIS

A 2

00

0 t

ech

nic

al r

epor

t

22

The field trial for most countries took placebetween February and July 1999, after whichitems were selected for the main study anddistributed to countries in December 1999.

Table 1 shows the PISA 2000 testdevelopment timeline.

Assessment Frameworkand Test Design

Development of the AssessmentFrameworks

The Consortium, through the test developers andexpert groups, and in consultation with nationalcentres, developed assessment frameworks forreading, mathematics and science. The frameworkswere endorsed by the Board of ParticipatingCountries (BPC) and published by the OECD in1999 in Measuring Student Knowledge and Skills:A New Framework for Assessment (OECD, 1999a).

The frameworks presented the direction beingtaken by the PISA assessments. They definedeach assessment domain, described the scope ofthe assessment, the number of items required toassess each component of a domain and thepreferred balance of question types, andsketched the possibilities for reporting results.

Test Design

The 141 main study reading items were organisedinto nine separate clusters, each with an estimatedadministration time of 30 minutes. The 32 mathem-atics items and the 35 science items were organisedinto four 15-minute mathematics clusters and four15-minute science clusters respectively. These

clusters were then combinedin various groupings toproduce nine linked two-hour test booklets.

Using R1 to R9 to denotethe reading clusters, M1 to M4to denote the mathematicsclusters and S1 to S4 todenote the science clusters,the allocation of clusters tobooklets is illustrated inFigure 2.

The BPC requested thatthe majority of bookletsbegin with reading itemssince reading was the majorarea. In the design, seven ofthe nine booklets begin withreading. The BPC furtherrequested that students notbe expected to switchbetween assessment areas,and so none of the booklets

requires students to return to any area afterhaving left it.

Reading items occur in all nine booklets, andthere are linkages between the reading in allbooklets. This permits all sampled students to beassigned reading scores on common scales.Mathematics items occur in five of the ninebooklets, and there are links between the fivebooklets, allowing mathematics scores to bereported on a common scale for five-ninths ofthe sampled students. Similarly, science materialoccurs in five linked booklets, allowing sciencescores to be reported on a common scale forfive-ninths of the sampled students.

In addition to the nine two-hour booklets, aspecial one-hour booklet, referred to as Bookletzero (or the SE booklet) was prepared for use inschools catering exclusively to students with specialneeds. The SE booklet was shorter, and designed tobe somewhat easier than the other nine booklets.

The two-hour test booklets, sometimesreferred to in this report as ‘cognitive booklets’

Table 1: Test Development Timeline

Activity Date

Develop frameworks January-September 1998

Item submission from countries May-July 1998

Develop items March-November 1998

Pre-pilot in Australia October and November 1998

Distribution of field trial material November 1998

Translation into national languages November 1998-January 1999

Pre-pilot in the Netherlands February 1999

Field trial marker training February 1999

Field trial in participating countries February-July 1999

Select items for main study July-November 1999

Cultural Review Panel meeting October 1999

Distribute main study material December 1999

Main study marker training February 2000

Main study in participating countries February-October 2000

Inst

rum

ent

dei

sgn

23

to distinguish them from the questionnaires,were arranged in two one-hour parts, each madeup of two of the 30-minute time blocks from thecolumns in the above figure. PISA’s proceduresprovided for a short break to be taken betweenadministration of the two parts of the testbooklet, and a longer break to be taken betweenadministration of the test and the questionnaire.

Test Development Team

The core of the test development team consistedof staff from the test development sections of theAustralian Council for Educational Research(ACER) in Melbourne and the Cito Group in theNetherlands. Many others, including members ofthe FEGs, translators and item reviewers atnational centres, made substantial contributionsto the process of developing the tests.

Test Development Process

Following the development of the assessmentframework, the process of test developmentincluded: calling for submissions of test itemsfrom participating countries; writing andreviewing items; pre-piloting the test material;preparing marking guides and marker trainingmaterial; and selecting items for the main study.

Item Submissions fromParticipating Countries

An international comparative study should drawitems from a wide range of cultures andlanguages. Thus, at the start of the PISA 2000

Booklet Block 1 Block 2 Block 3 Block 41 R1 R2 R4 M1 M2

2 R2 R3 R5 S1 S2

3 R3 R4 R6 M3 M4

4 R4 R5 R7 S3 S4

5 R5 R6 R1 M2 M3

6 R6 R7 R2 S2 S3

7 R7 R1 R3 R8

8 M4 M2 S1 S3 R8 R9

9 S4 S2 M1 M3 R9 R8

Figure 2: Test Booklet Design for the Main Survey

test development process, the Consortium calledfor participating countries to submit items.

A document outlining submission guidelineswas prepared stating the purpose of the project,and the working definition of each subjectdomain as drafted by the FEGs. In particular, thedocument described PISA’s literacy orientationand its aim of tapping students’ preparedness forlife. The Consortium requested authenticmaterials in their original forms, preferably fromthe news media and original published texts. Thesubmission guidelines included a list ofvariables—such as, text types and formats,response formats and contexts (and situations)of the materials—according to which countrieswere requested to classify their submissions. Astatement on copyright status was also sought.Countries submitted items either directly to theConsortium, or through the FEG members whoadvised on their appropriateness before sendingthem on. Having FEG members act as liaisonserved the purpose of addressing translationissues, as many FEG members spoke languagesother than English, and could deal directly withitems written in languages other than English.

A total of 19 countries contributed materialsfor reading, 12 countries contributed materialsfor mathematics, and 13 countries contributedmaterials for science. For a full list of countryand language source for the main study items,see Appendix 1 (for reading), Appendix 2 (formathematics) and Appendix 3 (for science).

Item Writing

The Consortium test development team reviewedcontributed items from countries to ensure thatthey were consistent with the frameworks andhad no obvious technical flaws. The testdevelopers also wrote many items from scratch toensure that the pool satisfied the frameworkspecifications for the composition of theassessment instruments. There were three separateteams of test developers for the three domains,but some integrated units were developed thatincluded both reading and science items.2 In all,660 minutes of reading material, 180 minutes ofmathematics material, 180 minutes of science

2 An Integrated Unit contained a stimulus text thatusually had a scientific content. A number of readingitems and science items followed the stimulus.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

24

material, and 60 minutes of integrated materialwere developed for the field trial.3

To ensure that translation did not change theintentions of the questions or the difficulty of theitems, the test developers provided translationnotes where appropriate, to underline potentialdifficulties with translations and adaptations andto avoid possible misinterpretations of theEnglish or French words.4 Test developers alsoprovided a summary of the intentions of theitems. Suggestions for adaptations were alsomade to indicate the scope of possible changes inthe translated texts and items. During thetranslation process, the test developersadjudicated lists of country adaptations toensure that the changes did not alter what theitems were testing.

The Item Review Process

The field trial was preceded by an extensivereview process of the items including panellingby the test development team, national reviewsin each country, reviews by the FEGs, pre-pilotsin Australian and Dutch schools, and closescrutiny by the team of translators.

Each institution (ACER and Cito Group)routinely subjected all item development to aninternal item panelling process5 that involvedstaff from the PISA test development team andtest developers not specifically involved in thePISA project. ACER and Cito Group alsoexchanged items and reviewed and providedcomments on them.

3 However, for the main study, no integrated unitswere selected.4 The PISA instruments were distributed in matchingEnglish and French source versions.5 In test development literature, item panelling issometimes referred to as item shredding or a cognitivewalk-through.

After item panelling, the items were sent infour batches6 to participating countries fornational review. The Consortium requestedratings from national subject matter experts ofeach item and stimulus according to (i) students’exposure to the content of the item (ii) itemdifficulty (iii) cultural concerns (iv) other biasconcerns (v) translation problems, and (vi) anoverall priority rating for inclusion of the item.In all, 27 countries provided feedback; theratings were collated by item and used by thetest development team to revise and select itemsfor the field trial.

Before the field trial, the reading FEG met onfour occasions (March, May, August, October1998), and the mathematics and science groupsmet on three occasions (March, May, October1998) to shape and draft the frameworkdocuments and review items in detail. Inaddition to bringing their expertise in aparticular discipline, members of the expertgroups came from a wide range of languagegroups and countries, cultures and educationsystems. For a list of members of the expertgroups, see Appendix 7.

Pre-pilots and Translation Input

In the small-scale pre-pilot conducted inAustralian and Dutch schools, about 35 studentresponses on average were obtained for eachitem and used to check the clarity of thestimulus and the items, and to construct sampleanswers in the Marking Guides (see nextsection). The pre-pilot also provided someindications about the necessary answer time. Asa result of the pilots, it was estimated that, onaverage, an item required about two minutes tocomplete.

Translation also provided another indirect,albeit effective, review. Translations were madefrom English to French and vice versa to providethe national translation teams with two sourceversions of all materials (see Chapter 5) and theteam often pointed out useful information suchas typographical errors, ambiguities andtranslation difficulties, and some cultural issues.The group of experts reviewing the translatedmaterial detected some remaining errors in bothversions. Finally one French-speaking member

6 Referred to as item bundles.

Inst

rum

ent

dei

sgn

25

and one English-speaking member of the readingFEG reviewed the items for linguistic accuracy.

Preparing Marking Guides and MarkerTraining Material

A Marking Guide was prepared for each subjectarea with items that required markers to coderesponses. The guide emphasised that markerswere to code rather than score responses. Thatis, the guides separated different kinds ofpossible responses, which did not all necessarilyreceive different scores. The actual scoring wasdone after the field trial data were analysed,which provided information on the appropriatescores for each different response category.7

The Marking Guide was a list of responsecodes with descriptions and examples, but aseparate training workshop document was alsoproduced for each subject area. Marker trainingfor the field trial took place in February 1999.Markers used the workshop material as exercisesto practise using the Marking Guides. Workshopmaterial contained additional student responses:the correct response codes were in hidden texts,so that the material could be used as a test formarkers and as reference material. In addition, aMarker Recruitment Kit was produced withmore student responses to be used as a screeningtest when recruiting markers.

PISA Field Trial and ItemSelection for the MainStudy

The PISA Field Trial was carried out in mostcountries in the first half of 1999. An average ofabout 150 to 200 student responses to each itemwere collected in each country. During the fieldtrial, the Consortium set up a marker queryservice. Countries were encouraged to sendqueries to the service so that a commonadjudication process was consistently applied toall markers’ questions.

Between July and November 1999, nationalreviewers, the test development team, FEGs anda Cultural Review Panel reviewed and selecteditems.

• Each country received an item review formand the country’s field trial report (seeChapter 9). The NPM, in consultation withnational committees, completed the forms.

• Test developers met in September 1999 to(i) identify items that were totally unsuitablefor the main study, and (ii) recommend thebest set for the main study, ensuring a balancein all aspects specified in the frameworks.FEG chairs were also invited to join the testdevelopers.

• The FEGs met with the test developers inOctober 1999 and reviewed and modified therecommended set of selected items. A final setof items was presented to the National ProjectManagers Meeting in November 1999.

• A Cultural Review Panel met at the end ofOctober 1999 to ensure that the selected itemswere appropriate for all cultures and wereunbiased for any particular group of students.Cultural appropriateness was defined to meanthat assessment stimuli and items wereappropriate for 15-year-old students in allparticipating countries, that they were drawnfrom a cultural milieu that, overall, wasequally familiar to all students, and that thetests did not violate any national culturalvalues or positions. The Cultural ReviewPanel also considered issues of gender andappropriateness across different socio-economic groups. For the list of members inthe Cultural Review Panel, see Appendix 7.Several sources were used in the item review

process after the field trial.• Participant feedback was collected during the

translation process, at the marker trainingmeeting, and throughout each country’smarking activity for the field trial. TheConsortium had compiled information aboutthe wording of items, the clarity of theMarking Guides, translation difficulties andproblems with graphics and printing whichwere all considered when items were revised.Furthermore, countries were asked once againto rate and comment on field trial items for (i) curriculum relevance, (ii) interest to 15-year-olds, (iii) cultural, gender, or other bias, and(iv) difficulties with translation and marking.

• A report on problematic items was prepared.The report flagged items that were easier orharder than expected; had a positive non-key point-biserial or a negative key point-biserial;

7 It is worth mentioning here that as data entry wascarried out using KeyQuest®, many short-responseswere entered directly, which saved time and made itpossible to capture students’ raw responses.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

26

non-ordered ability for partial credit items; ora poor fit. While some of these problemscould be traced to the occasional translationor printing error, this report gave somestatistical information on whether there was asystematic problem with a particular item.

• A marker reliability study providedinformation on marker agreement for codingresponses to open-ended items and was usedto revise items and/or Marking Guides, or toreject items altogether.

• Differential Item Functioning (DIF) analyseswith regard to gender, reading competence andsocio-economic status (SES). Items exhibitinglarge DIF were candidates for removal.

• Item attributes. Test developers monitoredclosely the percentage of items in eachcategory for various attributes. For example,open-ended items were limited to thepercentages set by the BPC, and covereddifferent aspects of the area in the proportionsspecified in the framework.

Item Bank Software

To keep track of all the item information and tofacilitate item selection, the Consortiumprepared item bank software that managed allphases of the item development. The databasestored text types and formats, item formats,contexts, and all other framework classifications,and had a comments field for the changes madethroughout the test development. After the fieldtrial, the item bank stored statistical informationsuch as difficulty and discrimination indices. Theselection process was facilitated by a modulethat let test developers easily identify thecharacteristics of the selected item pool. Itemswere regularly audited to ensure that theinstruments that were developed satisfiedframework specifications.

PISA 2000 Main Study

After items for the PISA 2000 Main Study wereselected from the field trial pool, items andMarking Guides were revised where necessary,clusters and booklets were made, and all mainstudy test instruments were prepared for dispatchto national centres. In particular, more extensiveand detailed sample responses were added to theMarking Guides using material derived from thefield trial student booklets. The collection ofsample responses was designed to anticipate therange of possible responses to each item requiringexpert marking, to ensure consistency in codingstudent responses, and therefore to maximise thereliability of marking. These materials were sentto countries on December 31, 1999.

Marker training materials were preparedusing sample responses collected from field trialbooklets. Double-digit coding was developed todistinguish between the score and the responsecode and used for 10 mathematics and 12 scienceitems. The double-digit codes alloweddistinctions to be retained between responsesthat were reflective of quite different cognitiveprocesses and knowledge. For example, if analgebraic approach or a trial-and-error approachwas used to arrive at a correct answer, a studentcould score a ‘1’ for an item using eithermethod, and the method would be reflected inthe second digit. The double-digit codingcaptures different problem-solving approachesby using the first digit to indicate the score andthe second digit to indicate method or approach.

Marker training meetings took place inFebruary 2000. Members of the test develop-ment team conducted separate marker trainingsessions for each of the three test domains.

As in the field trial, a marker query servicewas established so that any queries from markingcentres in each country could be directed to therelevant Consortium test developer, andconsistent advice could be quickly provided.

Finally, when cleaning and analysis of datafrom the main study had progressed sufficientlyto permit item analysis on all items and thegeneration of item-by-country interaction data,the test developers provided advice on itemswith poor measurement properties, and wherethe interactions were potentially problematic.This advice was one of the inputs to thedecisions about item deletion at either theinternational or national level (see Chapter 13).

PISA 2000 TestCharacteristics

An important goal in building the PISA 2000tests was to ensure that the final form of thetests met the specifications as given in the PISA2000 framework (OECD 1999a). For reading,

Table 2: Distribution of Reading Items by Text Structure and by Item Type

Text Structure Number Multiple- Complex Closed Open Shortof Items Choice Multiple- Constructed- Constructed- Response

Choice Response Response

Continuous 89 (2) 42 3 3 34 (2) 7

Non-continuous 52 (7) 14 (2) 4 (1) 12 9 (1) 13 (3)

Total 141 (9) 56 (2) 7 (1) 15 43 (3) 20

Note: The numbers in parentheses indicate the number of items deleted after the main study analysis.

Table 3: Distribution of Reading Items by Type of Task (Process) and by Item Type

Type of Task Number Multiple- Complex Closed Open Short(Process) of Items Choice Multiple- Constructed- Constructed- Response


Interpreting 70 (5) 43 (2) 3 (1) 5 14 (2) 5

Reflecting 29 3 2 - 23 1

Retrieving Information 42 (4) 10 2 10 6 (1) 14 (3)

Total 141 (9) 56 (2) 7 (1) 15 43 (3) 20 (3)


this meant that an appropriate distributionacross: Text Structure, Type of Task (Process),Text Type, Context and Item Type was required.The attained distributions, which closelymatched the goals specified in the framework,are shown by framework category and itemtype8 in Tables 2, 3, 4 and 5.

Table 4: Distribution of Reading Items by Text Type and by Item Type

Text Type Number Multiple- Complex Closed Open Shortof Items Choice Multiple- Constructed- Constructed- Response


Advertisements 4 (3) - - - 1 3 (3)Argumentative/

Persuasive 18 (1) 7 1 2 8 (1) -Charts/Graphs 16 (1) 8 (1) - 2 3 3Descriptive 13 (1) 7 1 - 4 (1) 1Expository 31 17 1 - 9 4Forms 8 1 1 4 1 1 Injunctive 9 3 - 1 5 -Maps 4 1 - - 1 2Narrative 18 8 - - 8 2Schematics 5 2 2 - - 1Tables 15 (3) 2 (1) 1 (1) 6 3 (1) 3

Total 141 (9) 56 (2) 7 (1) 15 43 (3) 20 (3)


8 Item types are explained in section “Test Scope andTest Format”.

Inst

rum

ent

dei

sgn

27

Table 6: Items from IALS

Unit Name Unit and Item ID

Allergies Explorers R239Q01Allergies Explorers R239Q02Bicycle R238Q02Bicycle R238Q01Contact Employer R246Q01Contact Employer R246Q02Hiring Interview R237Q01Hiring Interview R237Q03Movie Reviews R245Q01Movie Reviews R245Q02New Rules R236Q02New Rules R236Q01Personnel R234Q02Personnel R234Q01Warranty Hot Point R241Q02Warranty Hot Point R241Q01

PIS

A 2

00

0 t

ech

nic

al r

epor

t

28

Table 5: Distribution of Reading Items by Context and by Item Type

Text Context Number Multiple- Complex Closed Open Shortof Items Choice Multiple- Constructed- Constructed- Response


Educational 39 (3) 22 4 1 4 8 (3)Occupational 22 4 1 4 9 4Personal 26 10 - 3 10 3Public 54 (6) 20 (2) 2 (1) 7 20 (3) 5

Total 141 (9) 56 (2) 7 (1) 15 43 (3) 20 (3)


For reading it was also determined that a linkto the IALS study should be explored. To makethis possible, 16 IALS items were included in thePISA 2000 item pool. The IALS link items arelisted in Table 6.

For mathematics an appropriate distributionacross: Overarching Concepts (mainmathematical theme), Competency Class,Mathematical Content Strands, Context andItem Type was required. The attaineddistributions, which closely matched the goalsspecified in the framework, are shown byframework category and item type in Tables 7, 8, 9 and 10.

Table 7: Distribution of Mathematics Items by Overarching Concepts and by Item Type

Overarching Number Multiple- Closed OpenConcepts of Items Choice Constructed- Constructed-

Response Response

Growth and Change 18 (1) 6 (1) 9 3Space and Shape 14 5 9 -

Total 32 (1) 11 (1) 18 3


Inst

rum

ent

dei

sgn

29

Table 8: Distribution of Mathematics Items by Competency Class and by Item Type

Competency Class Number Multiple- Closed Openof Items Choice Constructed- Constructed-

Response Response

Class 1 10 4 6 -

Class 2 20 (1) 7 (1) 11 2

Class 3 2 - 1 1

Total 32 (1) 11 (1) 18 3


Table 9: Distribution of Mathematics Items by Mathematical Content Strands and by Item Type

Mathematical Number Multiple- Closed OpenContent of Items Choice Constructed- Constructed-Strands Response Response

Algebra 5 - 4 1Functions 5 4 - 1Geometry 8 3 5 -Measurement 7 (1) 3 (1) 4 -Number 1 - 1 -Statistics 6 1 4 1

Total 32 (1) 11 (1) 18 3


Table 10: Distribution of Mathematics Items by Context and by Item Type

Context Number Multiple- Closed Openof Items Choice Constructed- Constructed-

Response Response

Community 4 - 2 2Educational 6 2 3 1Occupational 3 1 2 -Personal 12 (1) 6 (1) 6 -Scientific 7 2 5 -

Total 32 (1) 11 (1) 18 3


For science an appropriate distribution across:Processes, Major Area, Content Application,Context and Item Type was required. Theattained distributions, which closely matched thegoals specified in the framework, are shown byframework category and item type inTables 11, 12, 13 and 14.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

30

Table 11: Distribution of Science Items by Science Processes and by Item Type

Science Processes Number Multiple- Complex Closed Open Shortof Items Choice Multiple- Constructed- Constructed- Response


Communicating to Others 3 - - - 3 -Valid Conclusions from Evidence and Data

Demonstrating 15 9 1 - 3 2Understanding of Scientific Knowledge

Drawing and Evaluating 7 (1) 1 2 (1) 1 3 -Conclusions

Identifying Evidence and Data 5 2 1 - 2 -

Recognising Questions 5 1 3 - 1 -

Total 35 (1) 13 7 (1) 1 12 2


Table 12: Distribution of Science Items by Science Major Area and by Item Type

Science Major Number Multiple- Complex Closed Open ShortArea of Items Choice Multiple- Constructed- Constructed- Response


Earth and Environment 13 3 2 1 6 1Life and Health 13 6 1 - 5 1Technology 9 1) 4 4 (1) - 1 -

Total 35 (1) 13 7 (1) 1 12 2


Table 13: Distribution of Science Items by Science Content Application and by Item Type

Science Content Number Multiple- Complex Closed Open ShortApplication of Items Choice Multiple- Constructed- Constructed- Response


Atmospheric Change 5 - 1 1 3 -Biodiversity 1 1 - - - -Chemical and Physical Change 1 - - - 1 -Earth and Universe 5 3 1 - - 1Ecosystems 3 2 - - 1 -Energy Transfer 4 (1) - 2 (1) 2 -Form and Function 3 1 - - 2 -Genetic Control 2 1 1 - - -Geological Change 1 - - - 1 -Human Biology 3 1 - - 2 -Physiological Change 1 - - - - 1Structure of Matter 6 4 2 - - -

Total 35 (1) 13 7 (1) 1 12 2


Inst

rum

ent

dei

sgn

31

Table 14: Distribution of Science Items by Context and by Item Type

Context Number Multiple- Complex Closed Open Shortof Items Choice Multiple- Constructed- Constructed- Response


Global 16 4 3 1 7 1

Historical 4 2 - - 2 -

Personal 8 4 2 - 2 -

Public 7 (1) 3 2 (1) - 1 1

Total 35 (1) 13 7 (1) 1 12 2


Student and SchoolQuestionnaire Development

Adrian Harvey-Beavis

Overview

A Student and a School Questionnaire were usedin PISA 2000 to collect data that could be usedin constructing indicators pointing to social,cultural, economic and educational factors thatare thought to influence, or to be associatedwith, student achievement. PISA 2000 did notinclude a teacher questionnaire.

Using the data taken from these twoquestionnaires, analyses linking context andinformation with student achievement couldaddress differences:• between countries in the relationships between

student-level factors (such as gender andsocial background) and achievement;

• between countries in the relationships betweenschool-level factors and achievement;

• in the proportion of variation in achievementbetween (rather than within) schools, anddifferences in this value across countries;

• between countries in the extent to whichschools moderate or increase the effects ofindividual-level student factors and studentachievement;

• in education systems and national contextsthat are related to differences in studentachievement among countries; and

• in any or all of these relationships over time.In May 1997 the OECD Member countries

through the Board of Participating Countries (BPC)established a set of priorities and their relativeindividual importance to guide the development ofthe PISA context questionnaires. These werewritten into the project’s terms of reference. For theStudent Questionnaire, this included the following

priorities, in descending order:• basic demographics (date of birth and

gender);• global measures of socio-economic status

(SES);• student description of school/instructional

processes;• student attitudes towards reading and reading

habits;• student access to educational resources

outside school (e.g., books in the home, aplace to study);

• institutional patterns of participation andprogramme orientation (e.g., the student’seducational pathway and currenttrack/stream);

• language spoken in the home;• nationality (e.g., country of birth of student

and parents, time of immigration); and• student expectations (e.g., career and

educational plans).The Member countries considered that all of

these priorities needed to be measured in eachcycle of PISA. Two other measures in-depthmeasure of student’s socio-economic status andthe opportunity to learn were also consideredimportant and were selected for inclusion asfocus topics for PISA 2000.

The BPC requested that the StudentQuestionnaire be limited in length toapproximately 30 minutes.

Chapter 3

For the School Questionnaire, the BPCestablished the following set of priorities, indescending order of importance:• quality of the school’s human and material

resources (e.g., student-teacher ratio,availability of laboratories, quality ofteachers, teacher qualifications);

• global measures of school-level SES (studentcomposition);

• school-level variables on instructional context;• institutional structure/type;• urbanisation/community type;• school size;• parental involvement; and• public/private control and funding.

The BPC considered that these prioritiesneeded to be covered in each cycle of PISA. Fourother measures considered important andselected for inclusion as focus topics for the PISA2000 School Questionnaire were listed in orderof priority:• school policies and practices;• school socio-economic status and quality of

human and material resources (same rankingas staffing patterns);

• staffing patterns, teacher qualifications andprofessional development; and

• centralisation/decentralisation of decision-making and school autonomy.The BPC requested that the School

Questionnaire be limited in length toapproximately 20 minutes.

Questions were selected or developed for in-clusion in the PISA context questionnaires fromthe set of BPC priorities. The choice of questions,and the development of new ones, was guided bya questionnaire framework document, whichevolved over the period of the first cycle ofPISA.1 This document indicated what PISA couldor could not do because of its design limitations,and provided a way to understand key conceptsused in the questionnaires. As the projectevolved, the reporting needs came to play anincreasingly important role in defining andrefining BPC priorities. In particular, the needsof the initial report and the proposed thematicreports helped to clarify the objectives of thecontext questionnaires.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

34

Development Process

The Questions

The first cycle of the PISA data collection couldnot always benefit from previous PISA instrumen-tation to guide the development of the contextquestionnaires. The approach therefore was toconsider each BPC priority and then scaninstruments used in other studies, particularlythose that were international, for candidatequestions. Experts and Consortium membersalso developed questions. A pool of questionswas developed and evaluated by experts on thebasis of:• how well questions would measure the

dimensions they purported to tap;• breadth of coverage of the concept;• ease of translation;• anticipated ease of use for the respondents;• cultural appropriateness; and• face validity.

National centres were consulted extensively be-fore the field trial to ensure meeting a maximumnumber of criteria. Further review was undertakenafter the consultation and a set of questions wasdecided upon for the field trial and reviewed forhow well they covered BPC priorities. In order totrial as many questionnaire items as possible, threeversions of the Student Questionnaire were usedin the field trial, but only one version of the SchoolQuestionnaire. This was due partly to the smallnumber of field trial schools and to the fact thatthe domains to be covered were less challengingthan some in the Student Questionnaire.

While the question pool was being developed,the questionnaire framework was being revised.This was reviewed and developed on an ongoingbasis for providing guidance to the theoreticaland practical limits of the questionnaires.

To obtain further information about theresponse of students to the field trial version ofthe questionnaire, School Quality Monitors (seeChapter 7) interviewed students after they hadcompleted them. This feedback was incorporatedinto an extensive review process after the fieldtrial. For the main study, further review andconsultation was undertaken involving theanalysis of data captured during the field trialwhich gave an indication of:• the distribution of values, including levels of

missing data;

1 The questionnaire framework was not published bythe OECD but is available as a project workingdocument.

• how well variables hypothesised to beassociated with achievement correlated with,in particular, reading achievement;

• how well scales had functioned; and• whether there had been problems with specific

questions within individual countries.Experts and national centres were again

consulted in preparation for the main study.

The Framework

The theoretical work guiding the development ofthe context questionnaires focused on policy-related issues.

The grid shown in Figure 3 is drawn from thework of Travers and Westbury (1989) [see alsoRobitaille and Garden (1989)] and was used toassist in planning the coverage of the PISA 2000 questionnaires. The factors thatinfluence student learning in any given countryare considered to be contextually shaped.Context is the set of circumstances in which astudent learns and they have antecedents thatdefine them in fundamental ways. Whileantecedents emerge from historical processes anddevelopments, PISA views them primarily interms of existing institutional and social factors.

A context finds expression in various ways—ithas a content. Content was seen in terms ofcurriculum for the Third InternationalMathematics and Science Study (TIMSS).

Stu

den

t an

d s

choo

l q

ues

tion

aire

dev

elop

men

t

35

However, PISA is based on a dynamic model oflifelong learning in which new knowledge andskills necessary for successful adaptation to achanging world are continuously acquiredthroughout life. It aims at measuring how wellstudents are likely to perform beyond the schoolcurriculum and therefore conceives contentmuch more broadly than other internationalstudies.

The antecedents, context and content repre-sent one dimension of the framework depicted inFigure 3. Each of these finds expression at adifferent level—in the educational system, theschool, the classroom, and the student.

These four levels constitute the second dimensionof the grid. The bold text in the matrix representthe core elements that PISA could cover given itsdesign. The design permits data to be collectedfor seven of the 12 cells depicted in Figure 3 andmarked in bold. The PISA design best coversschool and student levels but does not permitcollecting much information about class level.PISA does not collect information on system leveldirectly either, but this can be derived from othersources or by aggregating some PISA variables atthe school or country level.

Questionnaires were developed through aprocess of review, evaluation, and consultationand further refined by examining field trial data,input from interviews with students, anddeveloping the questionnaire framework.

Antecedents Context Content

System 1. Country features 2. Institutional settings 3. Intended Schoolingand policies Outcomes

School 4. Community and school 5. School conditions and 6. Implemented curriculumcharacteristics processes

Class 7. Teacher background 8. Class conditions and 9. Implemented curriculumcharacteristics processes

Student 10. Student background 11. Student classroom 12. Attained Schoolingcharacteristics behaviours Outcomes

Figure 3: Mapping the Coverage of the PISA 2000 Questionnaires

Coverage

The PISA context questionnaires covered all BPCpriorities with the exception of Opportunity toLearn, which the BPC subsequently removed asa priority.

Student Questionnaire

Basic demographics

Basic demographics information collected included:date of birth; grade at school; gender; familystructure; number of siblings; and birth order.

Family background and measures of socio-economic status

The family background variables included:parental engagement in the workforce; parentaloccupation (which was later transformed intothe PISA International Socio-Economic Index ofOccupational Status, ISEI); parental education;country of birth of students and parents;language spoken at home; social and culturalcommunication with parents; family educationalsupport; activities related to classical culture;family wealth; home educational resources; useof school resources; and home possessionsrelated to classical culture.

Student description of school/instructionalprocesses

For the main study, data were collected on classsize; disciplinary climate; teacher-studentrelations; achievement press; teacher support;frequency of homework; and time spent onhomework. This group of variables also coveredinstruction time; sense of belonging; teachers’use of homework; and whether a studentattended a remedial class, and provided data onstudents’ marks in mathematics, science and thestudents’ main language class3.

Student attitudes towards reading andreading habits

Attitudes to reading and reading habits includedfrequency of borrowing books; reading diversity;time spent in reading; engagement in reading;and number of books in the home.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

36

Student access to educational resourcesoutside school

Students were asked about different types ofeducational resources available in the homeincluding books, computers and a place to study.

Institutional patterns of participation andprogramme orientation

The programme in which the student wasenrolled was identified within the StudentQuestionnaire and coded into the InternationalStandard Classification of Education (ISCED)(OECD, 1999b), making the data comparableacross countries.

Student career and educational expectationsStudents were asked about their career andeducational plans in terms of job expectationsand highest education level expected.

School Questionnaire

Basic school charactecristics

Administrators were asked about schoollocation; school size; hours of schooling; schooltype (public or private control and funding);institutional structure; and type of school(including grade levels, admission and transferpolicies; length of the school year in weeks andclass periods; and available programmes).

School policies and practicesThis area included the centralisation ordecentralisation of school and teacher autonomyand decision-making; policy practices concerningassessment methods and their use; and theinvolvement of parents.

School climateIssues addressed under this priority area includedperception of teacher and student-related factorsaffecting school climate; and perception ofteachers’ morale and commitment.

School resourcesThis area included the quality of the schools’educational resources and physical infrastructureincluding student-teaching staff ratio; shortageof teachers; and availability of computers.

3 Because of difficulties associated with operation-alising the question on ‘marks’, this topic was aninternational option.

Cross-NationalAppropriateness of Items

The cross-national appropriateness of items wasan important design criterion for thequestionnaires. Data collected within any onecountry had to allow valid comparisons withothers. National committees through NPMsreviewed questions to minimisemisunderstandings or likely misinterpretationsbased on different educational systems, culturesor other social factors.

To further facilitate cross-nationalappropriateness, PISA used typologies designedfor use in collecting international data; namely,the International Standard Classification ofEducation (ISCED), and the InternationalStandard Classification of Occupations (ISCO).Once occupations had been coded into ISCO, thecodes were re-coded into the International Socio-Economic Index of Occupational Status (ISEI),which provides a measure of the socio-economicstatus of occupations in all countries participatingin PISA (see Chapter 17).

Cross-curricularCompetencies andInformation Technology

In addition to these two context questionnairesdeveloped by the Consortium, there were twoother instruments, available as internationaloptions,4 that were developed by OECD expertsor consultants.

The Cross-curricular CompetenciesInstrument

Early in the development of PISA, theimportance of collecting information on avariety of cross-curricular skills, knowledge andattitudes considered to be central goals in mostschool systems became apparent. While theseskills, knowledge, and attitudes may not be partof the formal curriculum, they can be acquiredthrough multiple, indirect learning experiencesinside and outside of school. From 1993 to 1996,

Stu

den

t an

d s

choo

l q

ues

tion

aire

dev

elop

men

t

37

the Member countries conducted a feasibil-ity studyon four cross-curriculum domains: Knowledge ofPolitics, Economics and Society (Civics);Problem-Solving; Self-Perception/ Self-Concept;and Communication Skills.

Nine countries participated in the pilot: Austria,Belgium (Flemish and French Communities),Hungary, Italy, the Netherlands, Norway, Switz-erland and the United States. The study was co-ordinated by the Department of Sociology at theUniversity of Groningen, on behalf of the OECD.The results (published in the OECD brochure,Prepared for Life (Peschar, 1997)) indicated thatthe Problem-Solving and the CommunicationSkills instruments required considerable invest-ment in developmental work before they couldbe used in an international assessment. However,both the Civics and the Self-Perception instrumentsshowed promising psychometric statistics andencouraging stability across countries.

The INES General Assembly, the EducationCommittee and the Centre for EducationalResearch and Innovation (CERI) GoverningBoard members were interested by the feasibilitystudy and encouraged the Member countries toinclude a Cross-Curriculum Competency (CCC)component in the Strategic Plan that they weredeveloping. The component retained for the firstcycle of PISA was self-concept, since a study oncivic education was being conducted at the sametime by the IEA (International Association forthe Evaluation of Educational Achievement) andit was decided to avoid possible duplications inthis area. The CCC component was not includedin the PISA General Terms of Reference, and theOECD subcontracted a separate agency toconduct this part of the study.

From various aspects of self-concept that wereincluded in the pilot study, the first cycle of PISAinvolved the use of an instrument that provided aself-report on self-regulated learning, based on thefollowing sets of constructs:• strategies of self-regulated learning, which

regulate how deeply and systematicallyinformation will be processed;

• motivational preferences and goalorientations, which regulate the investment oftime and mental energy for learning purposesand influence the choice of learning strategies;

• self-regulated cognition mechanisms, whichregulate the standards, aims, and processes ofaction in relation to learning;

4 That is, countries could use these instruments, theConsortium would support their use, and the datacollected would be included in the international dataset.

• action control strategies, particularly effortand persistence, which prevent the learnerfrom being distracted by competing intentionsand help to overcome learning difficulties; and

• preferences for different types of learningsituations, learning styles and social skills thatare required for co-operative learning.A total of 52 items was used in PISA 2000 to

tap these five main dimensions.A rationale for the self-regulated learning

instrument and the items used is presented inDEELSA/PISA/BPC (98)25. This documentfocuses on the dynamic model of continuouslifelong learning that underlies PISA, which itdescribes as:

‘new knowledge and skills necessary forsuccessful adaptation to changingcircumstances are continuously acquiredover the life span. While students cannotlearn everything they will need to knowin adulthood they need to acquire theprerequisites for successful learning infuture life. These prerequisites are bothof cognitive and motivational nature.This depends on individuals having theability to organise and regulate their ownlearning, including being prepared tolearn independently and in groups, andthe ability to overcome difficulties in thelearning process. Moreover, furtherlearning and the acquisition of additionalknowledge will increasingly occur insituations in which people work togetherand are dependent on one another.Therefore, socio-cognitive and socialcompetencies will have a directlysupportive function.’

The Information Technology orComputer Familiarity Instrument

The PISA 2000 Information Technology (IT) orComputer Familiarity instrument consisted of 10 questions, some of which had several parts.The first six were taken from an instrumentdeveloped by the Educational Testing Service(ETS) in the United States. ETS developed a 23-item questionnaire in response to concerns thatthe Test of English as a Foreign Language(TOEFL), taken mostly by students from othercountries as part of applying to study in the

United States, may have been producing biasedresults when administered on a computer becauseof different levels of computer familiarity (seeEignor et al., 1998). Since the instrument provedto be short and efficient, it was suggested that itbe added to PISA. Network A and the BPCapproved the idea.

Of the 23 items in the ETS instrument, 14were used in the PISA 2000 questionnaire, someof which were grouped or slightly reworked. Forexample, the ETS instrument asks How wouldyou rate your ability to use a computer? and thePISA questionnaire asks, If you compare yourselfwith other 15-year-olds, how would you rateyour ability to use a computer?

The last four questions in the PISAquestionnaire (ITQ7 to ITQ10) were added toprovide a measure of students’ interest in andenjoyment of computers, which are not coveredin the ETS questionnaire.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

38

Sample Design

Sheila Krawchuk and Keith Rust

Target Population andOverview of the SamplingDesign

The desired base PISA target population in eachcountry consisted of 15-year-old studentsattending educational institutions located withinthe country. This meant that countries were toinclude (i) 15-year-olds enrolled full-time ineducational institutions, (ii) 15-year-olds enrolledin educational institutions who attended on onlya part-time basis, (iii) students in vocationaltraining types of programmes, or any otherrelated type of educational programmes, and (iv)students attending foreign schools within thecountry (as well as students from other countriesattending any of the programmes in the firstthree categories). It was recognised that notesting of persons schooled in the home,workplace or out of the country would occurand therefore these students were not included inthe International Target Population.

The operational definition of an age populationdirectly depends on the testing dates. Theinternational requirement was that the assessmenthad to be conducted during a 42-day period1

between 1 March 2000 and 31 October 2000.Further, testing was not permitted during the firstthree months of the school year because of aconcern that student performance levels may belower at the beginning of the academic year thanat the end of the previous academic year.

The 15-year-old International TargetPopulation was slightly adapted to better fit the

age structure of most of the NorthernHemisphere countries. As the majority of thetesting was planned to occur in April, theinternational target population was consequentlydefined as all students aged from 15 years and3 (completed) months to 16 years and2 (completed) months at the beginning of theassessment period. This meant that in allcountries testing in April 2000, the nationaltarget population could have been defined as allstudents born in 1984 who were attending aschool or other educational institution.

Further, a variation of up to one month in thisage definition was permitted. For instance, acountry testing in March or in May was stillallowed to define the national target populationas all students born in 1984. If the testing was totake place at another time, the birth datedefinition had to be adjusted and approved bythe Consortium.

The sampling design used for the PISAassessment was a two-stage stratified sample inmost countries. The first-stage sampling unitsconsisted of individual schools having 15-year-old students. In all but a few countries, schoolswere sampled systematically from a comprehensivenational list of all eligible schools withprobabilities proportional to a measure of size.2

The measure of size was a function of theestimated number of eligible 15-year-oldstudents enrolled. Prior to sampling, schools inthe sampling frame were assigned to strataformed either explicitly or implicitly.

1 Referred to as the testing window.2 Referred to as Probability Proportional to Size (orPPS) sampling.

section two: operations

Chapter 4

The second-stage sampling units in countriesusing the two-stage design were students withinsampled schools. Once schools were selected tobe in the sample, a list of each sampled school’s15-year-old students was prepared. From eachlist that contained more than 35 students,35 students were selected with equal probabilityand for lists of fewer than 35, all students on thelist were selected.

In three countries, a three-stage design wasused. In such cases, geographical areas weresampled first (called first-stage units) usingprobability proportional to size sampling, andthen schools (called second-stage units) wereselected within sampled areas. Students were thethird-stage sampling units in three-stage designs.

Population Coverage, andSchool and StudentParticipation RateStandards

To provide valid estimates of studentachievement, the sample of students had to beselected using established and professionallyrecognised principles of scientific sampling, in away that ensured representation of the full targetpopulation of 15-year-old students.

Furthermore, quality standards had to bemaintained with respect to (i) the coverage of theinternational target population, (ii) accuracy andprecision, and (iii) the school and studentresponse rates.

Coverage of the International PISATarget Population

In an international survey in education, the typesof exclusion must be defined internationally andthe exclusion rates have to be limited. Indeed, ifa significant proportion of students wereexcluded, this would mean that survey resultswould not be deemed representative of the entirenational school system. Thus, efforts were madeto ensure that exclusions, if they were necessary,were minimised.

Exclusion can take place at the school level(the whole school is excluded) or at the within-school level. In PISA, there are several reasonswhy a school or a student can be excluded.

Exclusions at school level might result from

removing a small, remote geographical regiondue to inaccessibility or size, or from removing alanguage group, possibly due to political,organisational or operational reasons. Areasdeemed by the Board of Participating Countriesto be part of a country (for the purpose ofPISA), but which were not included for samplingwere designated as non-covered areas, anddocumented as such—although, this occurredinfrequently. Care was taken in this regardbecause, when such situations did occur, thenational desired target population differed fromthe international desired target population.

International within-school exclusion rules forstudents were specified as follows:• Educable mentally retarded students are

students who are considered in theprofessional opinion of the school principal,or by other qualified staff members, to beeducable mentally retarded or who have beentested psychologically as such. This categoryincludes students who are emotionally ormentally unable to follow even the generalinstructions of the test. Students were not tobe excluded solely because of poor academicperformance or normal discipline problems.

• Functionally disabled students are studentswho are permanently physically disabled insuch a way that they cannot perform in thePISA testing situation. Functionally disabledstudents who could respond were to beincluded in the testing.

• Non-native language speakers only studentswho had received less than one year ofinstruction in the language(s) of the test wereto be excluded.A school attended only by students who would

be excluded for mental, functional or linguisticreasons was considered as a school exclusion.

It was required that the overall exclusion ratewithin a country be kept below 5 per cent.Restrictions on the level of exclusions of varioustypes were as follows:• School-level exclusions for inaccessibility, size,

feasibility or other reasons were required tocover fewer than 0.5 per cent of the totalnumber of students in the International PISATarget Population.

• School-level exclusions for educable mentallyretarded students, functionally retarded studentsor non-native language speakers were requiredto cover fewer than 2 per cent of students.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

40

• Within-school exclusions for educablementally retarded students, functionallyretarded students or non-native languagespeakers were required to cover fewer than2.5 per cent of students.

Accuracy and Precision

A minimum of 150 schools (or all schools ifthere were fewer than 150 schools in aparticipating jurisdiction) had to be selected ineach country. Within each participating school,35 students were randomly selected with equalprobability, or in schools with less than35 eligible students, all students were selected. Intotal, a minimum sample size of 4 500 assessedstudents was to be achieved. It was possible tovary the number of students selected per school,but if fewer than 35 students per school were tobe selected, then: (i) the sample size of schoolswas increased beyond 150, so as to ensure thatat least 4 500 students were sampled, and (ii) thenumber of students selected per school had to beat least 20, so as to ensure adequate accuracy inestimating variance components within andbetween schools—an analytical objective of PISA.

National Project Managers were stronglyencouraged to identify stratification variables toreduce the sampling variance.

School Response Rates

A response rate of 85 per cent was required forinitially selected schools. If the initial schoolresponse rate fell between 65 and 85 per cent, anacceptable school response rate could still beachieved through the use of replacement schools.To compensate for a sampled school that did notparticipate, where possible two replacementschools were identified for each sampled school.Furthermore, schools with a student participationrate between 25 and 50 per cent were notconsidered as a participating school for thepurposes of calculating and documenting responserates. However, data from such schools wereincluded in the database and contributed to theestimates included in the initial PISA internationalreport. Data from schools with a studentparticipation rate of less than 25 per cent werenot included in the database.

The rationale for this approach was asfollows. There was concern that, in an effort to

meet the requirements for school response rates,a national centre might accept participation fromschools that would not make a concerted effortto have students attend the assessment sessions.To avoid this, a standard for studentparticipation was required for each individualschool in order that the school be regarded as aparticipant. This standard was set at 50 per cent.However, there were a few schools in manycountries that conducted the assessment withoutmeeting that standard. Thus a post-hocjudgement was needed to decide if the data fromstudents in such schools should be used in theanalyses, given that the students had alreadybeen assessed. If the students from such schoolswere retained, non-response bias would beintroduced to the extent that the students whowere absent were different in achievement fromthose who attended the testing session, and sucha bias is magnified by the relative sizes of thesetwo groups. If one chose to delete all assessmentdata from such schools, then non-response biaswould be introduced to the extent that theschool was different from others in the sample,and sampling variance is increased because ofsample size attrition.

The judgement was made that, for a schoolwith between 25 and 50 per cent studentresponse, the latter source of bias and variancewas likely to introduce more error into the studyestimates than the former, but with the conversejudgement for those schools with a studentresponse rate below 25 per cent. Clearly the cut-off of 25 per cent is an arbitrary one, as onewould need extensive studies to try to establishthis cut-off empirically. However, it is clear that,as the student response rate decreases within aschool, the bias from using the assessed studentsin that school will increase, while the loss insample size from dropping all of the students inthe school will rapidly decrease.

Figure 4 provides a summary of theinternational requirements for school responserates.

These PISA standards applied to weightedschool response rates. The procedures forcalculating weighted response rates are presentedin Chapter 8. Weighted response rates weighteach school by the number of students in thepopulation that are represented by the studentssampled from within that school. The weightconsists primarily of the enrolment size of

Op

erat

ion

s

41

participating: the student response rate wascomputed using only students from schools withat least a 50 per cent response rate. Again,weighted student response rates were used forassessing this standard. Each student wasweighted by the reciprocal of his/her sampleselection probability. Weighted and unweightedrates differed little within any country.3

Main Study School Sample

Definition of the National TargetPopulation

National Project Managers (NPMs) were firstrequired to confirm their dates of testing and agedefinition with the PISA Consortium. Once thesewere approved, NPMs were alerted to avoidhaving the possible drift in the assessment periodlead to an unapproved definition of the nationaltarget population

Every NPM was required to define anddescribe his/her country’s national desired target

15-year-old students in the school, divided bythe selection probability of the school. Becausethe school samples were in general selected withprobability proportional to size, in mostcountries most schools contributed equalweights, so that weighted and unweighted schoolresponse rates were very similar. Exceptionscould occur in countries that had explicit stratathat were sampled at very different rates.However, in no case did a country differsubstantially with regard to the response ratestandard when unweighted rates were used,compared to weighted rates.

Details as to how the PISA participantsperformed relative to these school response ratestandards are included in the quality indicatorssection of Chapter 15.

Student Response Rates

A response rate of 80 per cent of selectedstudents in participating schools was required. Astudent who had participated in the first part ofthe testing session was considered to be a partic-ipant. A student response rate of 50 per cent wasrequired for a school to be regarded as

PIS

A 2

00

0 t

ech

nic

al r

epor

t

42

100%

95%

90%

85%

80%

75%

70%

65%

60%

60%0 65% 70% 75% 80% 85% 90% 95% 100%

OECD PISASchool Response Rates

Not Acceptable

Response Rate Before Replacements

Res

pons

e R

ate

Aft

er R

epla

cem

ents

Replacementsuccessful

Tried replacementbut without success

Figure 4: School ResponseRate Standards - Criteriafor Acceptability

3 After students from schools with below 50 per centstudent response were removed from the calculations.

population and explain how and why it mightdeviate from the international target population.Any hardships in accomplishing completecoverage were specified, discussed and approvedor not, in advance. Where the national desiredtarget population deviated from full nationalcoverage of all eligible students, the deviationswere described and enrolment data provided tomeasure how much coverage was reduced.

School-level and within-school exclusionsfrom the national desired target populationresulted in a national defined target populationcorresponding to the population of studentsrecorded on each country’s school samplingframe. Schools were usually excluded forpractical reasons such as increased survey costsor complexity in the sample design and/ordifficult test conditions. They could be excluded,depending on the percentage of 15-year-oldstudents involved, if they were geographicallyinaccessible (but not part of a region omittedfrom the national desired target population), orextremely small, or if it was not feasible toadminister the PISA assessment. Thesedifficulties were mainly addressed by modifyingthe sample design to reduce the number of suchschools selected rather than to exclude them,and exclusions from the national desired targetpopulation were held to a minimum and werealmost always below 0.5 per cent. Otherwise,countries were instructed to include the schoolsbut to administer the PISA SE booklet4,consisting of a subset of the easier PISAassessment items.

Within-school, or student-level, exclusionswere generally expected to be less than 2.5 percent in each country, allowing an overall level ofexclusions within a country to be no more than5 per cent. Because definitions of within-schoolexclusions could vary from country to country,however, NPMs were asked to adapt thefollowing rules to make them workable in theircountry but still to code them according to thePISA international coding scheme.

Within participating schools, all eligiblestudents (i.e., born within the defined timeperiod, regardless of grade) were to be listed.From this, either a sample of 35 students wasrandomly selected or all students were selected if

Op

erat

ion

s

43

there were fewer than 35 15-year-olds. The listshad to include sampled students deemed to meetone of the categories for exclusion, and avariable maintained to briefly describe thereason for exclusion. This made it possible toestimate the size of the within-school exclusionsfrom the sample data.

It was understood that the exact extent ofwithin-school exclusions would not be knownuntil the within-school sampling data werereturned from participating schools, andsampling weights computed. Country participantprojections for within-school exclusionsprovided before school sampling were known tobe estimates.

NPMs were made aware of the distinctionbetween within-school exclusions and non-response. Students who could not take theachievement tests because of a permanentcondition were to be excluded and those with atemporary impairment at the time of testing,such as a broken arm, were treated as non-respondents along with other absent sampledstudents.

Exclusions by country participant aredocumented in Chapter 12 in Section 4.

The Sampling Frame

All NPMs were required to construct a schoolsampling frame to correspond to their nationaldefined target population. This was defined bythe manual as a frame that would providecomplete coverage of the national defined targetpopulation without being contaminated byincorrect or duplicate entries or entries referringto elements that were not part of the definedtarget population. Initially, this list was toinclude any school with 15-year-old students,even those who might later be excluded. Thequality of the sampling frame directly effects thesurvey results through the schools’ probabilitiesof selection and therefore their weights and thefinal survey estimates. NPMs were thereforeadvised to be very careful in constructing theirframes, while realising that the frame dependslargely on the availability of appropriateinformation about schools and students.

All but three countries used school-levelsampling frames as their first stage of sampleselection. The Sampling Manual indicated thatthe quality of sampling frames for both two and

4 The SE booklet, also referred to as Booklet 0, wasdescribed in the section on test design in Chapter 2.

three-stage designs would largely depend on theaccuracy of the approximate enrolment of 15-year-olds available (ENR) for each first-stagesampling unit. A suitable ENR value was acritical component of the sampling frames sinceselection probabilities were based on it for bothtwo and three-stage designs. The best ENR forPISA would have been the number of currentlyenrolled 15-year-old students. Current enrolmentdata, however, were rarely available at the timeof sampling, which meant using alternatives.Most countries used the first option from thebest alternatives given as follows:• student enrolment in the target age category

(15-year-olds) from the most recent year ofdata available;

• if 15-year-olds tend to be enrolled in two ormore grades, and the proportions of studentswho are 15 in each grade are approximatelyknown, the 15-year-old enrolment can beestimated by applying these proportions to thecorresponding grade-level enrolments;

• the grade enrolment of the modal grade for15-year-olds; or

• total student enrolment, divided by thenumber of grades in the school.The Sampling Manual noted that if reasonable

estimates of ENR did not exist or if the availableenrolment data were too out of date, schoolsmight have to be selected with equalprobabilities. This situation did not occur forany country.

Besides ENR values, NPMs were instructedthat each school entry on the frame shouldinclude at minimum:• school identification information, such as a

unique numerical national identification, andcontact information such as name, addressand phone number; and

• coded information about the school, such asregion of country, school type and extent ofurbanisation, which could be used asstratification variables.5

As noted, three-stage designs and area-levelsampling frames were used by three countrieswhere a comprehensive national list of schoolswas not available and could not be constructedwithout undue burden, or where the procedures

PIS

A 2

00

0 t

ech

nic

al r

epor

t

44

for administering the test required that the schoolsbe selected in geographic clusters. As aconsequence, area-level sampling framesintroduced an additional stage of frame creationand sampling (called the first stage of sampling)before actually sampling schools (the second stageof sampling). Although generalities about three-stage sampling and using an area-level samplingframe were outlined in the Sampling Manual (forexample that there should be at least 80 first-stageunits and about half of them needed to besampled), NPMs were also instructed in theSampling Manual that the more detailedprocedures outlined there for the general two-stage design could easily be adapted to the three-stage design. NPMs using a three-stage designwere also asked to notify the Consortium andreceived additional support in using an area-levelsampling frame. The countries that used a three-stage design were Poland, the Russian Federationand the United States. Germany also used a three-stage design but only in two of its explicit data.

Stratification

Prior to sampling, schools were to be ordered, orstratified, in the sampling frame. Stratificationconsists of classifying schools into like groupsaccording to some variables—referred to asstratification variables. Stratification in PISA wasused for the following reasons:• to improve the efficiency of the sample design,

thereby making the survey estimates morereliable;

• to apply different sample designs, such asdisproportionate sample allocations, tospecific groups of schools, such as those instates, provinces, or other regions;

• to ensure that all parts of a population wereincluded in the sample; and

• to ensure adequate representation of specificgroups of the target population in the sample.There were two types of stratification possible:

explicit and implicit. Explicit stratification con-sists of building separate school lists, or samplingframes, according to the set of explicit stratificationvariables under consideration. Implicit stratificationconsists essentially of sorting the schools within eachexplicit stratum by a set of implicit stratificationvariables. This type of stratification is a very simpleway of ensuring a strictly proportional sampleallocation of schools across all implicit strata. It

5 Variables used for dividing the population intomutually exclusive groups so as to improve theprecision of sample-based estimates.

can also lead to improved reliability of surveyestimates, provided that the implicit stratificationvariables being considered are correlated withPISA achievement (at the school level). Guidelineswere provided on how to go about choosingstratification variables.

Table 15 provides the explicit stratificationvariables used by each country, as well as thenumber of explicit strata, and the variables and

their number of levels used for implicitstratification.6

6 As countries were requested to sort the samplingframe by school size, school size was also an implicitstratification variable, though it is not listed in Table 15. A variable used for stratification purposes is not necessarily included in the PISA data files.

Op

erat

ion

s

45

Table 15: Stratification Variables

Country Explicit Stratification Number of Implicit Stratification Variables Explicit Strata Variables

Australia State/Territory (8); Sector (3) 24 Metropolitan/Country (2)

Austria School Type (20) 19 School Identification Number(State, district, school)

Belgium (Fl.) School Type (5) 5 Within strata 1-3: Combinations ofSchool Track (8); Organised by: Private/Province/State/Local Authority/Other

Belgium (Fr.) School Size (3); 4 Proportion of Over-Age StudentsSpecial Education

Brazil School Grade (grades 5-10, 3 Public/Private (2); Region (5); 5-6 only, 7-10) and Urbanisation Score Range (5)

Canada Province (10); Language(3); 49 Public/Private (2); Urban/Rural (2)School Size (5)

Czech Republic School Type (6); School Size (4) 24 None

Denmark School Size (3) 3 School Type (4); County (15)

England School Size (2) 2 Very small schools: School type (2);Region (4) Other schools: SchoolType (3); Exam results: LEA schools(6), Grant Maintained schools (3)or Coeducational status inIndependent schools (3); Region (4)

Finland Region (6); Urban/Rural (2) 11 None

France Type of school for large 6 Noneschools (4); Very small schools; Moderately small schools

Germany Federal State (16); School Type (7) 73 For Special Education andVocational Schools: Federal State (16)

Greece Region (10); Public/Private (2) 20 School Type (3)

Hungary School Type (5) 2 Region (20)

Iceland Urban/Rural (2); School Size (4) 8 None

PIS

A 2

00

0 t

ech

nic

al r

epor

t

46

Table 15 (cont.)

Country Explicit Stratification Number of Implicit StratificationVariables Explicit Strata Variables

Ireland School Size (3) 3 School Type (3); School GenderComposition (3)

Italy School Type (2); School Size (3); 21 Area (5)School Programme (4)

Latvia School Size (3) 3 Urbanisation (3); School Type (5)

Korea School Type and Urbanisation 10 Administrative Units (16)

Japan Public/Private (2); 4 NoneSchool Programme (2)

Liechtenstein Public/Private (2) 2 None

Luxembourg None 1 None

Mexico School Size (3) 3 School Type for Lower Secondary(4); School Type for UpperSecondary (3)

Netherlands None 1 School Track (5)

New Zealand School Size (3) 3 Public/Private (2); School GenderComposition (3); Socio-EconomicStatus (3); Urban/Rural (2)

Northern Ireland None 1 School Type (3); Exam Results:Secondary (4), Grammar (2);Region (5)

Norway School Size/School Type (5) 5 None

Poland Type of School (4) 4 Type of Gmina (Region) (4)

Portugal School Size (2) 2 Public/Private (2); Region (7);Index of Social Development (4)

Russian Federation Region (45); Small Schools (2) 47 School Programme (2); Urbanisation(6)

Scotland Public/Private (2) 2 For Public Schools: Average SchoolAchievement (5)

Spain Community (17); Public/Private (2) 34 Geographic unit in somecommunities; Population size inothers

Sweden Public (1); Private (3); Community 10 Private schools: Community Group for Public Schools (5); Group (9); Public schools: Income Secondary Schools (1) Quartile (4); Secondary Schools:

Geographic Area (22)

Switzerland Area and Language (11); Public/ 27 National Track (9); Canton (26); Private (2); School Programme (4); Cantonal TrackHas grade 9 or not (2); School Size (3)

United States None 1 Public/Private (2); High/LowMinority (2); Private School Type(3); Primary Sampling Unit (52)

Note: The absence of brackets indicates that numerous levels existed for that stratification variable.

Treatment of small schools in stratificationIn PISA, small, moderately small and very smallschools were identified, and all others wereconsidered large. A small school had anapproximate enrolment of 15-year-olds (ENR)below the target cluster size (TCS = 35 in mostcountries) of numbers of students to be sampledfrom schools with large enrolments. A very smallschool had an ENR less than one-half the TCS—17 or less in most countries. A moderatelysmall school had an ENR in the range of to TCS. Unless they received special treatment,small schools in the sample could reduce thesample size of students for the national sampleto below the desired target because the in-schoolsample size would fall short of expectations. Asample with many small schools could also be anadministrative burden. To minimise theseproblems, procedures for stratifying andallocating school samples were devised for smallschools in the sampling frame.

To determine what was needed—a singlestratum of small schools (very small andmoderately small combined), a stratum of verysmall schools only, or two strata, one of verysmall schools and one of moderately smallschools—the Sampling Manual stipulated that ifthe percentage of students in small schools(those schools for which ) was:• less than 5 per cent and the percentage in very

small schools (those for which ( )was less than 1 per cent, no small schoolstrata were needed.

• less than 5 per cent but the percentage in verysmall schools was 1 per cent or more, astratum for very small schools was needed.

• 5 per cent or more, but the percentage in verysmall schools was less than 1 per cent, astratum for small schools was needed, but nospecial stratum for very small schools wasrequired.

• 5 per cent or more, and the percentage in verysmall schools was 1 per cent or more, astratum for very small schools was needed,and a stratum for moderately small schoolswas also needed.The small school strata could be further

divided into additional explicit strata on thebasis of other characteristics (e.g., region).Implicit stratification could also be used.

When small schools were explicitly stratified,it was important to ensure that an adequate

sample was selected without selecting too manysmall schools as this would lead to too fewstudents in the assessment. In this case, the entireschool sample would have to be increased tomeet the target student sample size.

The sample had to be proportional to thenumber of students and not to the number ofschools. Suppose that 10 per cent of studentsattend moderately small schools, 10 per centvery small schools and the remaining 80 per centattend large schools. In the sample of 5 250, 4 200 students would be expected to come fromlarge schools (i.e., 120 schools by 35 students),525 students from moderately small schools and525 students from very small schools. Ifmoderately small schools had an average of25 students, then it would be necessary to include21 moderately small schools in the sample. If theaverage size of very small schools was 10 students,then 52 very small schools would be needed inthe sample and the school sample size would beequal to 193 schools rather than 150.

To balance the two objectives of selecting anadequate sample of explicitly stratified smallschools, a procedure was recommended thatassumes identifying strata of both very small andmoderately small schools. The underlying idea isto under-sample by a factor of two the verysmall school stratum and to increaseproportionally the size of the large school strata.When there was just a single small schoolstratum, the procedure was modified by ignoringthe parts concerning very small schools. Theformulae below also assume a school sample sizeof 150 and a student sample size of 5 250.

- Step 1: From the complete sampling frame,find the proportions of total ENR that comefrom very small schools (P), moderately smallschools (Q), and larger schools (those withENR of at least TCS) (R). Thus .

- Step 2: Calculate the figure L, where. Thus L is a positive number

slightly less than 1.0.

- Step 3: The minimum sample size for largerschools is equal to , rounded tothe nearest integer. It may need to be enlargedbecause of national considerations, such asthe need to achieve minimum sample sizes forgeographic regions or certain school types.

150 × ( )R L

L P= −( )1 2

P Q R+ + =1

ENR TCS< 2

ENR TCS<

TCS 2

Op

erat

ion

s

47

- Step 4: Calculate the mean value of ENR formoderately small schools (MENR), and forvery small schools (VENR). MENR is anumber in the range of TCS/2 to TCS, andVENR is a number no greater than TCS/2.

- Step 5: The number of schools that must besampled from the stratum of moderately smallschools is given by: (5 250 x Q)/(L x MENR)

- Step 6: The number of schools that must besampled from the stratum of very smallschools is given by: (2 625 x P)/(L x VENR).

To illustrate the steps, suppose that inparticipant country X, the TCS is equal to 35,with 0.1 of the total enrolment of 15-year-oldsin moderately small schools and in very smallschools. Suppose that the average enrolment inmoderately small schools is 25 students, and invery small schools it is 10 students. Thus P = 0.1,Q = 0.1, R = 0.8, MENR = 25 and VENR = 10.

From Step 2, L = 0.95, then (Step 3) thesample size of larger schools must be at least150 x (0.80/0.95) = 126.3. That is, at least 126 ofthe larger schools must be sampled. From Step 5,the number of moderately small schools requiredis (5 250 x 0.1)/(0.95 x 25) = 22.1 —i.e., 22 schools.From Step 6, the number of very small schoolsrequired is (2 625 x 0.1)/(0.95 x 10) = 27.6—i.e., 28 schools.

This gives a total sample size of126 + 22 2 28 = 176 schools, rather than just 150,or 193 as calculated above. Before consideringschool and student non-response, the largerschools will yield a sample of 126 x 35 = 4 410students. The moderately small schools will givean initial sample of approximately 22 x 25 = 550students, and very small schools will give aninitial sample size of approximately 28 x 10 = 280students. The total initial sample size of studentsis therefore 4 410 + 550 + 280 = 5 240.

Assigning a Measure of Size to EachSchool

For the probability proportional to size samplingmethod used for PISA, a measure of size (MOS)derived from ENR was established for eachschool on the sampling frame. Where no explicitstratification of very small schools was requiredor if small schools (including very small schools)

were separately stratified because school size wasan explicit stratification variable and they didnot account for 5 per cent or more of the targetpopulation, NPMs were asked to construct theMOS as: .

The measure of size was therefore equal to theenrolment estimate, unless it was less than theTCS, in which case it was set equal to the targetcluster size. In most countries, so thatthe MOS was equal to ENR or 35, whicheverwas larger.

When an explicit stratum of very smallschools (schools with ENR less than or equal toTCS/2) was required, the MOS for each verysmall school was set equal to TCS/2. As sampleschools were selected according to their size (PPS),setting the measure of size of small schools to 35is equivalent to drawing a simple, randomsample of small schools.

School Sample Selection

Sorting the sampling frame

The Sampling Manual indicated that, prior toselecting schools from the school samplingframe, schools in each explicit stratum were tobe sorted by variables chosen for implicitstratification and finally by the ENR valuewithin each implicit stratum. The schools werefirst to be sorted by the first implicitstratification variable, then by the secondimplicit stratification variable within the levels ofthe first sorting variable, and so on, until allimplicit stratification variables were exhausted.This gave a cross-classification structure of cells,where each cell represented one implicit stratumon the school sampling frame. NPMs were toalternate the sort order between implicit strata,from high to low and then low to high, etc.,through all implicit strata within an explicitstratum.

School sample allocation over explicit strataThe total number of schools to be sampled ineach country needed to be allocated among theexplicit strata so that the expected proportion ofstudents in the sample from each explicitstratum was approximately the same as thepopulation proportions of eligible students ineach corresponding explicit stratum. There weretwo exceptions. If an explicit stratum of verysmall schools was required, students in them had

TCS = 35

MOS ENR TCS= ( )max ,

PIS

A 2

00

0 t

ech

nic

al r

epor

t

48

smaller percentages in the sample than those inthe population. To compensate for the resultingloss of sample, the large school strata hadslightly higher percentages in the sample thanthe corresponding population percentages. Theother exception occurred if only one school wasallocated to any explicit stratum. In these cases,NPMs were requested to allocate two schoolsfor selection in the stratum.

Determining which schools to sampleThe Probability Proportional to Size (PPS)systematic sampling method used in PISA firstrequired the computation of a sampling intervalfor each explicit stratum. This calculationinvolved the following steps:• recording the total measure of size, S, for all

schools in the sampling frame for eachspecified explicit stratum;

• recording the number of schools, D, to besampled from the specified explicit stratum,which was the number allocated to theexplicit stratum;

• calculating the sampling interval, I, asfollows: I = S/D; and

• recording the sampling interval, I, to fourdecimal places.Next, a random number (drawn from a

uniform distribution) had to be selected for eachexplicit stratum. The generated random number(RN) was to be a number between 0 and 1 andwas to be recorded to four decimal places.

The next step in the PPS selection method ineach explicit stratum was to calculate selectionnumbers—one for each of the D schools to beselected in the explicit stratum. Selectionnumbers were obtained using the followingmethod:• obtaining the first selection number by

multiplying the sampling interval, I, by therandom number, RN. This first selectionnumber was used to identify the first sampledschool in the specified explicit stratum;

• obtaining the second selection number bysimply adding the sampling interval, I, to thefirst selection number. The second selectionnumber was used to identify the secondsampled school; and

• continuing to add the sampling interval, I, tothe previous selection number to obtain thenext selection number. This was done until allspecified line numbers (1 through D) had been

assigned a selection number.Thus, the first selection number in an explicit

stratum was RN x I, the second selectionnumber was (RN x I) + I, the third selectionnumber was (RN x I) + I + I, and so on.

Selection numbers were generated independentlyfor each explicit stratum, with a new randomnumber selected for each explicit stratum.

Identifying the sampled schoolsThe next task was to compile a cumulativemeasure of size in each explicit stratum of theschool sampling frame that determined whichschools were to be sampled. Sampled schoolswere identified as follows.

Let Z denote the first selection number for aparticular explicit stratum. It was necessary tofind the first school in the sampling frame wherethe cumulative MOS equalled or exceeded Z.This was the first sampled school. In otherwords, if Cs was the cumulative MOS of aparticular school S in the sampling frame andC(s-1) was the cumulative MOS of the schoolimmediately preceding it, then the school inquestion was selected if: Cs was greater than orequal to Z, and C(s-1) was strictly less than Z.Applying this rule to all selection numbers for agiven explicit stratum generated the originalsample of schools for that stratum.

Identifying replacement schoolsEach sampled school in the main survey wasassigned two replacement schools from thesampling frame, identified as follows. For eachsampled school, the schools immediatelypreceding and following it in the explicit stratumwere designated as its replacement schools. Theschool immediately following the sampled schoolwas designated as the first replacement andlabelled R1, while the school immediatelypreceding the sampled school was designated asthe second replacement and labelled R2. TheSampling Manual noted that in small countries,there could be problems when trying to identifytwo replacement schools for each sampledschool. In such cases, a replacement school wasallowed to be the potential replacement for twosampled schools (a first replacement for thepreceding school, and a second replacement forthe following school), but an actual replacementfor only one school. Additionally, it may havebeen difficult to assign replacement schools for

Op

erat

ion

s

49

some very large sampled schools because thesampled schools appeared very close to eachother in the sampling frame. There were timeswhen NPMs were only able to assign a singlereplacement school, or even none, when twoconsecutive schools in the sampling frame weresampled.

Exceptions were allowed if a sampled schoolhappened to be the last school listed in anexplicit stratum. In this case the two schoolsimmediately preceding it were designated asreplacement schools, or the first school listed inan explicit stratum, in which case the twoschools immediately following it were designatedreplacement schools.

Assigning school identifiersTo keep track of sampled and replacementschools in the PISA database, each was assigneda unique, three-digit school code and two-digitstratum code (corresponding to the explicitstrata) sequentially numbered starting with onewithin each explicit stratum. For example, if 150 schools are sampled from a single explicitstratum, they are assigned identifiers from 001 to 150. First replacement schools in themain survey are assigned the school identifier oftheir corresponding sampled schools, incrementedby 200. For example, the first replacementschool for sampled school 023 is assigned schoolidentifier 223. Second replacement schools in themain survey are assigned the school identifier oftheir corresponding sampled schools, butincremented by 400. For example, the secondreplacement school for sampled school 136 tookthe school identifier 536.

Tracking sampled schoolsNPMs were encouraged to make every effort toconfirm the participation of as many sampledschools as possible to minimise the potential fornon-response biases. They contacted replacementschools after all contacts with sampled schoolswere made. Each sampled school that did notparticipate was replaced if possible. If both anoriginal school and a replacement participated,only the data from the original school wereincluded in the weighted data.

Monitoring the School SampleAll countries had the option of having theirschool sample selected by the Consortium.Belgium (Fl.), Belgium (Fr.), Latvia, Norway,Portugal and Sweden took this option. Theywere required to submit sampling forms 1 (timeof testing and age definition), 2 (national desiredtarget population), 3 (national defined targetpopulation), 4 (sampling frame description), 5(excluded schools), 7 (stratification), 11 (schoolsampling frame) and 12 (school tracking form).The Consortium completed and returned theothers (forms 6, 8, 9 and 10).

For each country drawing its own sample, thetarget population definition, the levels ofexclusions, the school sampling procedures, andthe school participation rates were monitored by12 sampling forms submitted to the Consortiumfor review and approval. Table 16 provides asummary of the information required on eachform and the timetables (which depended onnational assessment periods). (See Appendix 9for copies of the sampling forms.)

Once received from each country, each formwas reviewed and feedback was provided to thecountry. Forms were only approved after allcriteria were met. Approval of deviations wasonly given after discussion and agreement by theConsortium. In cases where approval could notbe granted, countries were asked to makerevisions to their sample design and samplingforms.

Checks that were performed in the monitoringof each form follow. All entries were observed intheir own right but those below are additionalmatters explicitly examined.

Sampling Form 1: Time of Testing and AgeDefinition• Assessment dates had to be appropriate for

the selected target population dates.• Assessment dates could not cover more than a

42-day period.

Sampling Form 2: National Desired TargetPopulation

• Large deviations between the total nationalnumber of 15-year-olds and the enrollednumber of 15-year-olds were questioned.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

50

Op

erat

ion

s

51

Table 16: Schedule of School Sampling Activities

Activity Sampling Form Submitted Due Dateto Consortium

Specify time of testing 1 - Time of Testing and Age At least six months before the beginning and age definition of Definition of testing and at least two months population to be tested before planned school sample selection.

Define National Desired 2 - National Desired Target Two months prior to the date of planned Target Population Population selection of the school sample.

Define National Defined 3 - National Defined Target Two months prior to the date of planned Target Population Population selection of the school sample.

Create and describe 4 - Sampling Frame Description One month before final approval of the sampling frame school sample is required, or two

months before data collection is tobegin, whichever is the earlier.

Decide on schools to 5 - Excluded Schools One month before final approval of the be excluded from school sample is required, or twosampling frame months before data collection is to

begin, whichever is the earlier.

Decide how to treat 6 - Treatment of Small Schools One month before final approval of thesmall schools school sample is required, or two


Decide on explicit and 7 - Stratification One month before final approval of theimplicit stratification school sample is required, or two variables months before data collection is to

begin, whichever is the earlier.

Describe population 8 - Population Counts by Strata One month before final approval of thewithin strata school sample is required, or two


Allocate sample over 9 - Sample Allocation by One month before final approval of theexplicit strata Explicit Strata school sample is required, or two


Select the school sample 10 - School Sample Selection One month before final approval of theschool sample is required, or twomonths before data collection is tobegin, whichever is the earlier.

Identify sampled schools, 11 - School Sampling Frame One month before final approval of the replacement schools and school sample is required, or two assign PISA school months before data collection is toidentification numbers begin, whichever is the earlier.

Create a school 12 - School Tracking Form Within one month of the end of the data tracking form collection period.

• Any population to be omitted from theinternational desired population was notedand discussed, especially if the percentage of15-year-olds to be excluded was more than 2 per cent.

• Calculations were verified.

Sampling Form 3: National Defined TargetPopulation

• The population figure in the first questionneeded to correspond with the finalpopulation figure on Sampling Form 2.

• Reasons for excluding schools were checkedfor appropriateness.

• The number and percentage of students to beexcluded at the school level and whether thepercentage was less than the maximum percent-age allowed for such exclusions were checked.

• Calculations were verified and the overallcoverage figures were assessed.

Sampling Form 4: Sampling Frame Description

• Special attention was paid to countries whoreported on this form that a three-stagesampling design was to be implemented andadditional information was sought fromcountries in such cases to ensure that the first-stage sampling was done adequately.

• The type of school-level enrolment estimateand the year of data availability were assessedfor reasonableness.

Sampling Form 5: Excluded Schools

• The number of schools and the total enrolmentfigures, as well as the reasons for exclusion,were checked to ensure correspondence withfigures reported on Sampling Form 3 aboutschool-level exclusions.

Sampling Form 6: Treatment of Small Schools

• Calculations were verified, as was the decisionabout whether or not a moderately smallschools stratum and/or a very small schoolsstratum were needed.

Sampling Form 7: Stratification

• Since explicit strata are formed to group likeschools together to reduce sampling varianceand to ensure appropriate representativenessof students in various school types, usingvariables that might have an effect onoutcomes, each country’s choice of explicit

PIS

A 2

00

0 t

ech

nic

al r

epor

t

52

stratification variables was assessed. If acountry was known to have school tracking,and tracks or school programmes were notamong the explicit stratifiers, a suggestionwas made to include this type of variable.

• If no implicit stratification variables werenoted, suggestions were made about ones thatmight be used.

• The sampling frame was checked to ensurethat the stratification variables were availablefor all schools. Different explicit strata wereallowed to have different implicit stratifiers.

Sampling Form 8: Population Counts by Strata

• Counts on Sampling Form 8 were comparedto counts arising from the frame. Anydifferences were queried and almost alwayscorrected.

Sampling Form 9: Sample Allocation byExplicit Strata

• All explicit strata had to be accounted for onSampling Form 9.

• All explicit strata population entries werecompared to those determined from thesampling frame.

• The calculations for school allocation werechecked to ensure that schools were allocatedto explicit strata based on explicit stratumstudent percentages and not explicit stratumschool percentages.

• The percentage of students in the sample foreach explicit stratum had to be close to thepercentage in the population for each stratum(very small schools strata were an exceptionsince under-sampling was allowed).

• The overall number of schools to be sampledwas checked to ensure that at least 150schools would be sampled.

• The overall number of students to be sampledwas checked to ensure that at least 5 250students would be sampled.

Sampling Form 10: School Sample Selection

• All calculations were verified.• Particular attention was paid to the four

decimal places that were required for both thesampling interval and the random number.

Sampling Form 11: School Sampling Frame

• The frame was checked for proper sortingaccording to the implicit stratification scheme

and enrolment values, and the properassignment of the measure of size value,especially for moderately small and very smallschools. The accumulation of the measure ofsize values was also checked for each explicitstratum. This final cumulated measure of sizevalue for each stratum had to correspond tothe ‘Total Measure of Size’ value on SamplingForm 10 for each explicit stratum.Additionally, each line selection number waschecked against the frame cumulative measureof size figures to ensure that the correct schoolswere sampled. Finally, the assignment ofreplacement schools and PISA identificationnumbers were checked to ensure that all ruleslaid out in the Sampling Manual were adheredto. Any deviations were discussed with eachcountry and either corrected or the deviationsaccepted.

Sampling Form 12: School Tracking Form

• Sampling Form 12 was checked to see that thePISA identification numbers on this formmatched those on the sampling frame.

• Checks were made to ensure that all sampledand replacement schools were accounted for.

• Checks were also made to ensure that statusentries were in the requested format.

Student Samples

Student selection procedures in the main studywere the same as those used in the field trial.Student sampling was generally undertaken atthe national centres from lists of all eligiblestudents in each school that had agreed toparticipate. These lists could have been preparedat national, regional, or local levels as data files,computer-generated listings, or by hand,depending on who had the most accurateinformation. Since it was very important that thestudent sample be selected from accurate,complete lists, the lists needed to be preparednot too far in advance of the testing and had tolist all eligible students. It was suggested that thelists be received one to two months beforetesting so that the NPM would have the time toselect the student samples.

Some countries chose student samples that in-cluded students aged 15 and/or enrolled in a specificgrade (e.g., grade 10). Thus, a larger overall sample,including 15-year-old students and students in

the designated grade (who may or may not havebeen aged 15) were selected. The necessary stepsin selecting larger samples are highlighted whereappropriate in the following steps. Only Iceland,Japan, Hungary and Switzerland selected gradesamples and none used the standard methoddescribed here. For Iceland and Japan, the samplewas called a grade sample because over 99.5 percent of the PISA eligible 15-year-olds were in thegrade sampled. Switzerland supplemented thestandard method with additional schools whereonly a sample of grade-eligible students (15-year-olds plus others) was selected. Hungary selecteda grade 10 classroom from each PISA school,independent of the PISA student sample.

Preparing a list of age-eligible studentsAppendix 10 shows an example Student ListingForm as well as school instructions about howto prepare the lists. Each school drawing anadditional grade sample was to prepare a list ofage and grade-eligible students that included allstudents in the designated grade (e.g., grade 10);and all other 15-year-old students (using theappropriate 12-month age span agreed upon foreach country) currently enrolled in other grades.

NPMs were to use the Student Listing Formas shown in the Appendix 10 example but coulddevelop their own instructions. The followingwere considered important:• Age-eligible students were all students born in

1984 (or the appropriate 12-month age spanagreed upon for the country).

• The list was to include students who mightnot be tested due to a disability or limitedlanguage proficiency.

• Students who could not be tested were to beexcluded from the assessment after thestudent sample was selected.

• It was suggested that schools retain a copy ofthe list in case the NPM had to call the schoolwith questions.

• A computer list was to be up-to-date at thetime of sampling rather than prepared at thebeginning of the school year.Students were identified by their unique

student identification numbers.

Selecting the student sampleOnce NPMs received the list of eligible studentsfrom a school, the student sample was to beselected and the list of selected students (i.e., the

Op

erat

ion

s

53

PIS

A 2

00

0 t

ech

nic

al r

epor

t

54

Student Tracking Form) returned to the school.NPMs were encouraged to use KeyQuest®, thePISA sampling software, to select the studentsamples. Alternatively, they were to undertakethe steps that follow.

Verifying student listsStudent lists were checked to confirm a 1984 (orwithin the agreed-upon time span) year of birthor grade specified by the NPM (if grade samplingtook place) and when this was not the case, thestudents name was crossed out. The schools werecontacted regarding any questions or discrepancies.Students not born in 1984 (or within the agreed-upon time span, or grade if grade sampling wasto be done) were to be deleted from the list (afterconfirming with the school). Duplicates weredeleted from the list before sampling.

Numbering students on the listUsing the line number column on the StudentListing Form (or the margin if using anotherlist), eligible students on the list were numberedconsecutively for each school. The numberingwas checked for repetition or ellipses, andcorrected before the student sample was selected.

Computing the sampling interval and random start

After verifying the list, the total number of 15-year-old students (born in 1984 or the agreed-upon time span) was recorded in Box A on theStudent Tracking Form (see Appendix 11). Thiswould be the same number as the last linenumber on the list, if no grade sampling werebeing done. The total number of listed studentswas entered in Box B. The desired sample size(usually 35) was generally the same for eachschool and was entered in Box C on the StudentTracking Form. A random number was chosenfrom the RN table and recorded as a four-digitdecimal (e.g., 3 279 = 0.3279) in Box D.

Whether or not grade sampling was being done,the sampling interval was computed by dividingthe total number of students aged 15 (Box A) bythe number in Box C, the planned student samplesize for each school, and recorded in Box E. If thenumber was less than 1.0000, a 1 was recorded. Ifgrade sampling was being done, this interval wasapplied to the larger list (grade and age-eligiblestudents) to generate a larger sample.

A random start point was computed by

multiplying the RN decimal by the samplinginterval plus one, and the result was recorded inBox F. The integer portion of this value was theline number of the first student to be selected.

Determining line numbers and selectingstudents

The sampling interval (Box E) was added to thevalue in Box F to determine the line numbers ofthe students to be sampled until the largest linenumber was equal to or larger than the totalnumber of students listed (Box B). Each linenumber selected was to be recorded in wholeintegers. Column (2) on the Student TrackingForm was used to record selected line numbersin sequence. NPMs were instructed to write an Sin front of the line number on the originalstudent list for each student selected. If a gradesample was being selected, the total studentsample was at least as large as (probablysomewhat larger than) the number in Box C,and included some age-eligible (about equal tothe number in Box C) and grade-eligiblestudents.

Preparing the Student Tracking FormOnce the student sample was selected, theStudent Tracking Form for each school could becompleted by the NPM. The Student TrackingForm was the central administration documentfor the study. When a Student Tracking Formwas sent to a school, it served as the completelist of the student sample. Once booklets wereassigned to students, the Student Tracking Formbecame the link between the students, the testbooklets, and background questionnaires thatthey received. After the testing, the results of thesession (in terms of presence, absence, exclusion,etc.) were entered on the Student Tracking Formand summarised. The Student Tracking Formwas sent back to the national centre with the testinstruments and was used to make sure that allmaterials were accounted for correctly.

The Student Tracking Form was completed asfollows:• The header of the Tracking Form was

prepared with the country name, the schoolname and the school identifier.

• Boxes A to F were checked for completion.These boxes provided the documentation forthe student sampling process.

• The names of the selected students were listed

in column (3) of the Student Tracking Form.The line number for each selected student,from the Listing Form, had already beenentered in column (2). The pre-printed studentnumber in column (1) was the student’s new‘ID’ number which became part of a uniquenumber used to identify the studentthroughout the assessment process.The Student Tracking Form, which consisted

of two sides, was designed to accommodatestudent samples of up to 35 students, which wasthe standard PISA design. If the student samplewas larger than 35, NPMs were instructed to usemore than one Student Tracking Form, and onthe additional forms, to renumber the pre-printed identification numbers in column (1) tocorrespond to the additional students. Forexample, column (1) of the second form wouldbegin with 36, 37, 38, etc.• For each sampled student, the student’s grade,

date of birth and sex were entered on theStudent Tracking Form. It was possible forNPMs to modify the form to include otherdemographic variables.

• The ‘Excluded’ column was to be left blankby the NPM. This column was to be used bythe school to designate any students whocould not be tested due to a disability orlimited language proficiency.

• If booklet numbers were to be assigned tostudents at the national centre, the bookletnumber assigned to each student was enteredin column (8). Participation status on theStudent Tracking Form was to be left blank atthe time of student sampling. Thisinformation would only be entered at theschool when the assessment was administered.

Preparing instructions for excluding studentsPISA was a timed assessment administered in theinstructional language(s) of each country anddesigned to be as inclusive as possible. Forstudents with limited language proficiency orwith physical, mental, or emotional disabilitieswho could not participate, PISA developedinstructions in cases of doubt about whether aselected student should be assessed.

NPMs used the guidelines in Figure 5 todevelop instructions; School Co-ordinators andTest Administrators7 needed precise instructionsfor exclusions. The national operationaldefinitions for within-school exclusions were to

Op

erat

ion

s

55

7 The roles of School Co-ordinators and TestAdministrators are described in detail in Chapter 6.

be well documented and submitted to theConsortium for review before testing.

Sending the Student Tracking Form to theSchool Co-ordinator and Test Administrator

The School Co-ordinator needed to knowwhich students were sampled in order to notifythem and their teachers (and parents), to updateinformation and to identify the students to beexcluded. The Student Tracking Form andGuidelines for Excluding Students were thereforesent about two weeks before the assessmentsession. It was recommended that a copy of theTracking Form be made and kept at the nationalcentre. Another recommendation was to havethe NPM send a copy of the form to the TestAdministrator with the assessment booklets andquestionnaires in case the school copy wasmisplaced before the assessment day. The TestAdministrator and School Co-ordinator manuals(see Chapter 6) both assumed that each wouldhave a copy.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

56

INSTRUCTIONS FOR EXCLUDING STUDENTS

The following guidelines define general categories for the exclusion of students withinschools. These guidelines need to be carefully implemented within the context of eacheducational system. The numbers to the left are codes to be entered in column (7) of theStudent Tracking Form to identify excluded students.

1 = Functionally disabled students. These are students who are permanently physicallydisabled in such a way that they cannot perform in the PISA testing situation.Functionally disabled students who can respond to the test should be included in thetesting.

2 = Educable mentally retarded students. These are students who are considered in theprofessional opinion of the school principal or by other qualified staff member to beeducable mentally retarded or who have been psychologically tested as such. Thisincludes students who are emotionally or mentally unable to follow even the generalinstructions of the test. However, students should not be excluded solely because ofpoor academic performance or disciplinary problems.

3 = Students with limited proficiency in the test language. These are students who areunable to read or speak the language of the test and would be unable to overcomethe language barrier in the test situation. Typically, a student who has received lessthan one year of instruction in the language of the test should be excluded, but thisdefinition may need to be adapted in different countries.

4 = Other.

It is important that these criteria be followed strictly for the study to be comparablewithin and across countries. When in doubt, the student was included.

Figure 5: Student Exclusion Criteria as Conveyed to National Project Managers

Translation and CulturalAppropriateness of the Testand Survey Material

Aletta Grisay

Introduction

Translation errors are known to be a majorcause for items to function poorly in internationaltests. They are much more frequent than otherproblems, such as clearly identified discrepanciesdue to cultural biases or curricular differences.

If a survey is done merely to rank countries orstudents, this problem can be avoided somewhatsince once the most unstable items have beenidentified and dropped, the few remainingproblematic items are unlikely to affect the overallestimate of a country’s mean in any significant way.

The aim of PISA, however, is to developdescriptive scales, and in this case translationerrors are of greater concern. Theirinterpretation can be severely biased by unstableitem characteristics from one country to another.PISA has therefore implemented stricterverification procedures for translationequivalence than those used in many priorsurveys. This chapter describes these proceduresand their implementation and outcomes.

A number of quality assurance procedureswere implemented in the PISA 2000 assessmentto ensure equivalent national test andquestionnaire materials. These included:• providing two parallel source versions (in

English and French), and recommending thateach country develop two independentversions in their instruction language (onefrom each source language), then reconcilethem into one national version;

• systematically adding information to the testand questionnaire materials to be translatedabout Question Intent, to clarify the scope

and characteristics, and appending frequentTranslation Notes for possible translation oradaptation problems;

• developing detailed translation/adaptationguidelines for the test material, and forrevising it after the field trial, as an importantpart of the PISA National Project ManagerManual;

• training key staff from each national team onrecommended translation procedures; and

• appointing and training a group ofinternational verifiers (professional translatorsproficient in English and French, with nativecommand of each target language), to verifythe national versions against the sourceversions.

Double Translation fromTwo Source Languages

A back translation procedure is the mostfrequently used to ensure linguistic equivalenceof test instruments in international surveys. Itrequires translating the source version of the test(generally English language) into the nationallanguages, then translating them back to andcomparing them with the source language toidentify possible discrepancies.

This technique is relatively effective fordetecting mistranslation or major interpretationproblems (Hambleton et al., in press). Forexample, in the original English version of oneof the PISA reading texts proposed for the fieldtrial, no lexical or grammatical clue enabled thereader to identify the main character’s gender.

Chapter 5

Many languages imposed a choice in almostevery sentence in which the character occurs.The comparison of several back translationsalmost inevitably brings out this type ofproblem.

Back translation has a serious deficiency,however, which has often been pointed out. Inmany cases, the translation is incorrect becauseit is too literally transposed, but there is a fairlyhigh risk that the back translation would merelyrecover the original text without revealing theerror.

Somerset Maugham’s short story The Ant andthe Grasshopper was initially proposed as areading text for the PISA 2000 field trial. Bothtranslators who worked on the French sourceversion translated the content of the underlinedsentence in the following passage word forword:

‘In this admirable fable (I apologise fortelling something which everyone is politely,but inexactly, supposed to know) the antspends a laborious summer gathering itswinter store, while the grasshopper sits on ablade of grass singing to the sun.’

Translation 1:«Dans cette fable remarquable (que le

lecteur me pardonne si je raconte quelquechose que chacun est courtoisement censésavoir, mais pas exactement), la fourmiconsacre un été industrieux à rassembler desprovisions pour l’hiver, tandis que la cigale lepasse sur quelque brin d’herbe, à chanter ausoleil. »

Translation 2:«Dans cette fable admirable (veuillez

m’excuser de rappeler quelque chose quechacun est supposé connaître par politessemais pas précisément), la fourmi passe un étélaborieux à constituer des réserves pourl’hiver, tandis que la cigale s’installe surl’herbe et chante au soleil. »

Both translations are literally correct, andwould back-translate into an English sentencequite parallel to the original sentence. However,both are semantically unacceptable, since neitherreflects the irony of ‘politely, but inexactlysupposed to know’. The versions were reconciledand eventually translated into:

« Dans cette fable remarquable (que l’on mepardonne si je raconte quelque chose quechacun est supposé connaître – suppositionqui relève de la courtoisie plus que del’exactitude), la fourmi consacre un étéindustrieux à rassembler des provisions pourl’hiver, tandis que la cigale le passe surquelque brindille, à chanter au soleil. »

Both translations 1 and 2 would probablyappear to be correct when back-translated,whereas the reconciled version would appearfurther from the original.

It is also interesting to note that both Frenchtranslators and the French reconciler preferred(rightly so, for a French-speaking audience) torevert to the cigale [i.e., cicada] of La Fontaine’soriginal text, while Somerset Maugham adaptedthe fable to his English-speaking audience byreferring to a grasshopper [which would havebeen sauterelle in French]. However, bothtranslators neglected to draw the entomologicalconsequences from their return to the original:they were too faithful to the English text andallowed a strictly arboreal insect [the cicada] tolive on a brin d’herbe [i.e., a blade of grass], thatthe reconciler replaced by a brindille [i.e., atwig]. In such a case, the back translationprocedure would consider cigale and brindille asdeviations or errors.

A double translation procedure (i.e., twoindependent translations from the sourcelanguage, and reconciliation by a third person)offers significant advantages in comparison withthe back translation procedure:• Equivalence of the source and target

languages is obtained by using three differentpeople (two translators and a reconciler) whoall work on the source and the target versions.In the back translation procedure, by contrast,the first translator is the only one to focussimultaneously on the source and targetversions.

• Discrepancies are recorded directly in thetarget language instead of in the sourcelanguage, as would be the case in a backtranslation procedure.These examples are deliberately borderline

cases, where both translators happened to makea common error (that could theoretically havebeen overlooked, had the reconciliation havebeen less accurate). But the probability of

PIS

A 2

00

0 t

ech

nic

al r

epor

t

58

detecting errors is obviously considerably higherwhen three people rather than one compare thesource language with the target language.

A double translation procedure was used inthe Third International Mathematics and ScienceStudy-Repeat (TIMSS-R) instead of the backtranslation procedure used in earlier studies bythe International Association for the Evaluationof Educational Achievement (IEA).

PISA used double translation from twodifferent languages because both backtranslation and double translation proceduresfall short in that the equivalence of the variousnational versions depends exclusively on theirconsistency with a single source version (ingeneral, English). This leads to implicitly givingmore weight than would be desirable on thosecultural forms related to the reference language.

Furthermore, one would wish for as purely asemantic equivalence as possible (since theprinciple is to measure access that students fromdifferent countries would have to a samemeaning, through written material presented indifferent languages). However, using a singlereference language is likely to give moreimportance than would be desirable to theformal characteristics of that language. If asingle source language is used, its lexical andsyntactic features, stylistic conventions andtypical organisational patterns of ideas withinthe sentence will have more impact thandesirable on the target language versions.

Other expected advantages from using twosource languages included:• The verification of the equivalence between

the source and the target versions wasperformed by four different people who allworked directly on the texts in the relevantnational languages (i.e., two translators, onenational reconciler and the Consortium’sverifier).

• The degree of freedom to take with respect toa source text often causes translationproblems. A translation that is too faithfulmay appear awkward; if it is too free or tooliterary it is very likely to fail to be equivalent.Having two source versions in differentlanguages (for which the translationfidelity/freedom has been carefully calibratedand approved by Consortium experts) providesbenchmarks for a national reconciler that arefar more accurate in this respect, that neither

Tra

nsl

atio

n a

nd

cu

ltu

ral

app

rop

riat

enes

s of

th

e te

st a

nd

su

rvey

mat

eria

l

59

back translation nor double translation froma single language could provide.

• Many translation problems are due toidiosyncrasies: words, idioms, or syntacticstructures in one language appearuntranslatable into a target language. In manycases, the opportunity to consult the othersource version will provide hints at solutions.

• Similarly, resorting to two different languageswill, to a certain extent, tone down problemslinked to the impact of cultural characteristicsof a single source language. Admittedly, bothlanguages used here share an Indo-Europeanorigin, which may be regrettable in thisparticular case. However, they do representsets of relatively different cultural traditions,and are both spoken in several countries withdifferent geographic locations, traditions,social structures and cultures.Nevertheless, as all major international

surveys prior to PISA always used English astheir source language, empirical evidence toinform on the consequences of using analternative reference language was lacking. Asfar as we know, the only interesting findings inthis respect were reported in the IEA/ReadingComprehension survey (Thorndike, 1973),which showed a better item coherence (factorialstructure of the tests, distribution of thediscrimination coefficients) between English-speaking countries than across otherparticipating countries. In this perspective, thedouble source procedure used in PISA was quiteinnovative.

Development of SourceVersions

Participating countries contributed most ofthe text passages used as stimuli in the PISA fieldtrial and some of the items: the material retainedin the field trial contained submissions from18 countries.1 An alternative source version was

1 Other than material sourced from IALS, 45 unitswere included in the field trial, each with its own setof stimulus materials. Of these, 27 sets of stimuli werefrom five different English-speaking countries while18 came from countries with 12 other languages(French 4, Finnish 3, German 2, Danish 1, Greek 1,Italian 1, Japanese 1, Korean 1, Norwegian 1, Russian1, Spanish 1 and Swedish 1) and were translated intoEnglish to prepare the English source version, whichwas then used to prepare the French version.

developed mainly through double translating anessentially English first source version of thematerial and reconciling it into French.

A group of French domain experts, chaired byMrs. Martine Rémond (one of the PISA ReadingFunctional Expert Group members), wasappointed to review the French source version forequivalence with the English version, forlinguistic correctness and for the appropriatenessof the terminology used (particularly formathematics and science material).

Time constraints led to starting the processbefore the English version was finalised. This putsome pressure on the French translators, whohad to carefully follow all changes made to theEnglish originals. It did however make itpossible for them to detect a few residual errorsoverlooked by the test developers, and toanticipate potential translation problems. Inparticular, a number of ambiguities or pitfallexpressions could be spotted and avoided fromthe beginning by slightly modifying the sourceversions, and the list of aspects requiringnational adaptations could be refined, andfurther translation notes could be added when aneed was identified. In this respect, thedevelopment of the French source version servedas a translation trial, and probably helpedprovide National Project Managers (NPMs) withsource material that was somewhat easier totranslate or contained fewer potential translationtraps than it would have had if a single sourcehad been developed.

Two additional features were embedded in bothsource versions to help translators. Translationnotes were added wherever the test developersthought it necessary to draw the translator’sattention to some important aspect, e.g.:• Imitate as closely as possible some stylistic

characteristic of the source version.• Indicate particular cases where the wording of

a question must be the same (or NOT thesame) as in a specific sentence in the stimulus;

• Alert to potential difficulties or translationtraps.

• Point out aspects for which a translator isexpressly asked to use national adaptations.

• A short description of the Question Intent wasprovided for each item, to help translatorsand test markers to better understand thecognitive process that the test developerswished to assess. This proved to be

particularly useful for the reading material,where the Question Intent descriptions oftencontained important information such aswhether a question:• required a literal match between the ques-

tion stem and the corresponding passage inthe text, or whether a synonym or paraphraseshould be provided in the question stem;

• required the student to make someinference or to find some implicit linkbetween the information given in variousplaces in the text. In such cases, forexample, the PISA Translation Guidelines(see next section) instructed the translatorsthat they should never add connectorslikely to facilitate the student’s task, such ashowever, because, on the other hand, bycomparison, etc., where the author did notput any (or the converse: never forget touse a connector if one had been used by theauthor); or

• was intended to assess the student’sperception of textual form or style. In suchcases, it was important that the translationconvey accurately such subtleties of thesource text as irony, word colour andnuances in character motives.

PISA TranslationGuidelines

NPMs also received a document with detailedinformation on: • PISA requirements in terms of necessary

national version(s). PISA takes as a generalprinciple that students should be tested in thelanguage of instruction used in their school.Therefore, the NPMs of multilingual countrieswere requested to develop as many versions ofthe test instruments as there were languages ofinstruction used in the schools included intheir national sample. Cases of minoritylanguages used in only a very limited numberof schools could be discussed with thesampling referee to decide whether suchschools could be excluded from the targetpopulation without affecting the overallquality of the data collection.

• Which parts of the materials had to be doubletranslated, or could be single translated.Double-translation was required for the

PIS

A 2

00

0 t

ech

nic

al r

epor

t

60

cognitive tests, questionnaires and for theoptional Computer Familiarity or InformationTechnology (IT) and Cross-CurriculumCompetency (CCC) instruments, but not forthe manuals and other logistic material.

• Instructions related to the recruitment oftranslators and reconcilers, their training, andthe scientific and technical support to beprovided to the translation team. It wassuggested, in particular, that translatedmaterial and national adaptations deemednecessary for inclusion be submitted forreview and approval to a national expertpanel composed of domain specialists.

• Description of the PISA translationprocedures. It was required that nationalversion(s) be developed by double translationand reconciliation with the source material. Itwas recommended that one independent trans-lator use the English source version and thatthe second use the French version. In countrieswhere the NPM had difficulty appointingcompetent French translators, doubletranslation from English only was consideredacceptable according the PISA standards.Other sections of the PISA Translation

Guidelines were more directly intended for useby the national translators and reconciler(s):• Recommendations to prevent common

translation traps a rather extensive sectiongiving detailed examples on problems frequentlyencountered when translating surveymaterials, and advice on how to avoid them;

• Instructions on how to use translation notesand descriptions of Question Intents includedin the material;

• Instructions on how to adapt the material tothe national context, listing a variety of rulesidentifying acceptable/unacceptable nationaladaptations;

• Special notes on translating mathematics andscience material;

• Special notes on translating questionnairesand manuals; and

• a National Adaptation Form, to documentnational adaptations included in the material.An additional section of the Guidelines was

circulated to NPMs after the field trial, togetherwith the revised materials to be used in the mainstudy, to help them and their translation teamrevise their national version(s).

Translation TrainingSession

NPMs received sample material to use forrecruiting national translators and training themat the national level. The NPM meeting held inNovember 1998, prior to the field trialtranslation activities, included a training sessionfor members of the national translation teams(or the person responsible for translationactivities) from the participating countries. Adetailed presentation was made of the material,of recommended translation procedures, of theTranslation Guidelines, and verification process.

InternationalVerification of theNational Versions

One of the most productive quality controlprocedures implemented in PISA to ensure highquality standards in the translated assessmentmaterials consisted in having a team ofindependent translators, appointed and trainedby the Consortium, verify each national versionagainst the English and French source versions.

Two Verification Co-ordination centres wereestablished. One was at the Australian Concilfor Educational Research (ACER) in Melbourne(for national adaptations used in the English-speaking countries and national versionsdeveloped by participating Asian countries). Thesecond one was at cApStAn, a translation firm inBrussels (for all other national versions,including the national adaptations used in theFrench-speaking countries). cApStAn had beeninvolved in preparing the French source versionof the PISA material, and was retained bothbecause of its familiarity with the study materialand because of its large network of internationaltranslators.

The Consortium undertook internationalverifications of all national versions in languagesused in schools attended by more than 5 per centof the country’s target population. For languagesused in schools attended by 5 per cent or lessminorities, international-level verification wasdeemed unnecessary since the impact on thecountry results would be negligible, andverification of very low frequency languages wasmore feasible at national level. In Spain, for

Tra

nsl

atio

n a

nd

cu

ltu

ral

app

rop

riat

enes

s of

th

e te

st a

nd

su

rvey

mat

eria

l

61

instance, the Consortium took up theverification of the Castilian and Catalan testmaterial (respectively 72 and 16 per cent of thetarget population), while the Spanish NationalCentre was responsible for verifying theirGalician, Valencian and Basque test material atnational level (respectively 3, 4 and 5 per cent ofthe target population).

For a few minority languages, nationalversions were only developed (and verified) inthe main study phase. This was consideredacceptable when a national centre had arrangedwith another PISA country to borrow its fieldtrial national version for their minority (adaptingthe Swedish version for the Swedish schools inFinland, the Russian version for Russian schoolsin Latvia), and when the minority language wasconsidered to be a dialect that differed onlyslightly from the main national language(Nynorsk in Norway).

A few English or French-speaking countries orcommunities (Canada (Fr.), Ireland, Switzerland(Fr.) and United Kingdom) were constrained byearly testing windows, and submitted onlyNational Adaptation Forms for verification. Thiswas also considered acceptable, since thesecountries used national versions that wereidentical to the source version except for thenational adaptations.

The main criteria used to recruit translators tolead the verification of the various nationalversions were that they were (or had):• native speakers of the target language;• experienced translators from either English or

French into their target language;• sufficient command of the second source

language (either English or French) to be ableto use it for cross-checks in the verification ofthe material; and

• experience as teachers and/or have highereducation degrees in psychology, sociology oreducation, as far as possible.All appointed verifiers met the first two

requirements; fewer than 10 per cent of them didnot have sufficient command of the Frenchlanguage to fully meet the third requirement;and two-thirds of the verifiers met the lastrequirement. The remaining one-third had higherdegrees in other fields, such as philosophy orpolitical and social sciences.

As a general rule, the same verifiers were usedfor homo-lingual versions (i.e., the various

national versions from English, French, German,Italian and Dutch-speaking countries orcommunities). However, the Portuguese languagediffers significantly from Brazil to Portugal, andthe Spanish language is not the same in Spainand in Mexico, so independent native translatorshad to be appointed for those four countries.

In a few cases, both in the field trial and themain study verification exercises, the timeconstraints were too tight for a single person tomeet the deadlines, and additional translatorshad to be appointed and trained.

Verifier training sessions were held inMelbourne and in Brussels, prior to theverification of both the field trial and the mainstudy material. Attendees received copies of thePISA information brochure, TranslationGuidelines, the English and French sourceversions of the material and a Verification CheckList developed by the Consortium. When madeavailable by the countries, a first bundle oftarget material was also delivered. The trainingsession focused on:• presenting verifiers with PISA objectives and

structure;• familiarising them with the material to be

verified;• reviewing and extensively discussing the

Translation Guidelines and the VerificationCheck List;

• checking that all verifiers who would work onelectronic files knew how to deal with someimportant Microsoft Word® commands (trackchanges, comments, edit captions in graphics,etc.) and (if hard copies only had beenprovided) that they knew how to use standardinternational proof-reading abbreviations andconventions;

• arranging for schedules and for dispatchlogistics; and

• insisting on security requirements.Verification for the field trial tests required on

average 80 to 100 hours per version. Somewhatless time was required for the main studyversions. Most countries submitted their materialin several successive shipments, sometimes witha large delay that was inconsistent with theinitially negotiated and agreed VerificationSchedule. The role of the verification co-ordinators proved to be particularly important inensuring continued and friendly contact with thenational teams, maintaining flexibility in the

PIS

A 2

00

0 t

ech

nic

al r

epor

t

62

availability of the verifiers, keeping records ofthe material being circulated, submitting doubtfulcases of too-free national adaptations to the testdevelopers, forwarding all last minute changesand edits introduced in the source versions bythe test developers to the verifiers, and archivingthe reviewed material and verification reportsreturned by the verifiers.

The PISA 2000 Field Trial led to theintroduction of a few changes in the main studyverification procedure:• Manuals to be used by the School Co-ordinator

and the Test Administrator should be includedin the material to be internationally verified inorder to help prevent possible deviations fromthe standard data collection procedures.

• Use of electronic files rather than hard copiesfor verifying the item pool and marking guideswas recommended in order to save time andto revise the verified material by the nationalteams more easily and more effectively.

• NPMs were asked to submit hard copies oftheir test booklets before sending them to theprinter, so that the verifiers could check thefinal layout and graphic format, and theinstructions given to the students. This alsoallowed them to check the extent to whichtheir suggestions had been implemented.

Translation andVerification ProcedureOutcomes

In an international study where reading was themajor domain, the equivalence of the nationalversions and the suitability of the proceduresemployed to obtain it was, predictably, the focusof many discussions both among the Board ofParticipating Countries (BPC) members andamong national research teams. The followingquestions in particular were raised.• Is equivalence across languages attainable at

all? Many linguistic features differ sofundamentally from language to language thatit may be unrealistic to expect equal difficultyof the reading material in all versions.

• In particular, the differences in the length oftranslated material from one language toanother are well known. In a time-limitedassessment, won’t students from certaincountries be at some disadvantage if all

stimuli and items are longer in their languagethan in others, requiring them to use moretime to read, and therefore decreasing thetime available for answering questions?

• Is double translation really preferable to otherwell-established methods such as back-translation?

• Is there a risk that using two source languageswill result in more rather than fewertranslation errors?

• What is the real value of such a costly andtime-consuming exercise as internationalverification? Wouldn’t the time and resourcesbe better spent on additional nationalverifications?

• Will the verifiers appointed by theConsortium be competent enough, whencompared to national translators chosen by anNPM and working under his or hersupervision?Concerns about the value of the international

verification were usually allayed when the NPMsstarted receiving their verified material.Although the quality of the national translationswas rather high (and, in some cases,outstanding) with only very few exceptions,verifiers still found many flaws in virtually allnational versions. The most frequent of themincluded:• Late corrections introduced in the source

versions of the material had been overlooked.• New mistakes had been introduced when

entering revisions (mainly typographicalerrors or subject-verb agreement errors).

• Graphics errors (keys, captions, textembedded in graphics). Access to a number ofthese appeared to be very difficult, partlybecause English words tend to be rather short,and the size of the text frames often lackedspace for longer formulations. The resultswere sometimes barely legible.

• Layout and presentation (frequent alignmentproblems and font or font size discrepancies).

• Line numbers (frequent changes in the layoutresulted in incorrect references to linenumbers).

• Incorrect quotations from the text, mostly dueto edits implemented in the stimulus withoutmaking them in question stems or scoringinstructions.

• Missing words or sentences.• Cases where a question contained a literal

Tra

nsl

atio

n a

nd

cu

ltu

ral

app

rop

riat

enes

s of

th

e te

st a

nd

su

rvey

mat

eria

l

63

match in the source versions but asynonymous match in the translated version—often due to the well-known translator’ssyndrome of avoiding double and triplerepetitions in a text.

• Inconsistencies in punctuation, instructions tostudents, ways of formulating the question,spelling of certain words, and so on.The verifiers’ reports also provided some

interesting indications of three types oftranslation errors that seemed to occur lessfrequently in national versions developedthrough double translation from the two sourceversions rather than through single translation ordouble translation from a single-source version.These included mistranslations; literal translationsof idioms or colloquial expressions (loantranslations); incorrect translation of questionswith stems such as ‘which of the following…’into national expressions meaning ‘which one ofthe following…’. A cross-check with the Frenchsource version (‘Laquelle ou lesquelles despropositions suivantes…’) would have alertedthe translator to the fact that the student wasexpected to provide one or more answers.

Based on the data collected in the field trial, afew empirical analyses could be made to exploresome of the other issues raised by the BPCmembers and the NPMs focused on twoquestions. To what extent could the English andFrench source versions be considered‘equivalent’? Did the recommended procedureactually result in better national versions thanother procedures used by some of theparticipating countries?

To explore these two issues, the followinganalyses were performed:• Since reading was the major domain in the

PISA 2000 assessment, it was particularlycrucial that the stimuli used in the test unitsbe as equivalent as possible in terms oflinguistic difficulty. Length, and a fewindicators of lexical and syntactic difficulty ofthe English and French versions of a sampleof PISA stimuli, were compared, usingreadability formulae to assess their relativedifficulty in both languages.

• The field trial data from the English andFrench-speaking countries or communitieswere used to check whether the psychometriccharacteristics of the items in the versionsadapted from the English source were similar

PIS

A 2

00

0 t

ech

nic

al r

epor

t

64

to those in the versions adapted from the Frenchsource. In particular, it was important to knowwhether any items showed flaws in all or mostof the French-speaking countries or com-munities, but none in the English-speakingcountries or communities, or vice versa.

• Based on the field trial statistics, the nationalversions developed through the recommendedprocedure were compared with those obtainedthrough alternative procedures to identify thetranslation methods that produced fewestflawed items.

Linguistic Characteristics of theEnglish and French Source Versionsof the Stimuli

Differences in length

The length of stimuli was compared using thetexts included in 50 reading and seven science orIntegrated units.2 No mathematics units wereincluded, since they had no or very short textstimuli. Some of the reading and science unitswere also excluded, since the texts were veryshort or they had only tables.

The French version of the stimuli proved to besignificantly longer than the English version.There were, on average, 12 per cent more wordsin French (410 words in the French stimuli onaverage compared to 367 in English). In addition,the average length of words is greater in French(5.09 characters per word, versus 4.83 charactersper word in English). So the total charactercount was considerably higher in French (onaverage, 18.84 per cent more characters in theFrench than in the English version of the sampledstimuli). This did not affect the relative length ofthe passages in the two languages, however. Thecorrelation between both the French and Englishword counts and between the French andEnglish character count was 0.99.

Some variation was observed from text to textin the increase in length from English to French.There was some evidence that texts that wereoriginally in French or in languages other thanEnglish tended to have fewer words (Pole Sud:2 per cent fewer; Shining Object: 4 per centfewer; Macondo: 5 per cent fewer) or to have

2 Integrated units were explored in the field trial, butthey were not used in the main study.

only minor differences (Police: 2.5 per centmore; Amanda and the Duchess: 1.2 per centmore; Rhinoceros: 5 per cent more; Just Judge: 1.6 per cent more; Corn: 4 per cent more).

Differences in text length and item difficultyForty-nine prose texts of sufficient length (morethan 150 words in the English version) wereretained for an analysis on the possible effects onitem difficulty of the higher word countassociated with translation into French. For tenof them, the French version was shorter than theEnglish version or very similar in length (lessthan a 5 per cent increase in the word count).Ten others had a 6 to 10 per cent increase in theFrench word count. Fifteen had an increase from11 to 20 per cent, and the remaining 14 had anincrease of more than 20 per cent.

Eleven PISA countries had English or Frenchas (one of) their instruction language(s). TheEnglish source version of the field trialinstruments was used, with a few nationaladaptations, in Australia, Canada (Eng.),Ireland, New Zealand, United Kingdom andUnited States. The French source version wasused, with a few national adaptations, inBelgium (Fr.), Canada (Fr.), France, Luxembourgand Switzerland (Fr.).

The booklets used in Luxembourg, however,contained both French and German material,which made it impossible to use the statistics

from Luxembourg for a comparison between theFrench and English data. Therefore the analysiswas done using the item statistics from theremaining 10 countries: six countries in theAdaptation from English group, and four in theAdaptation from French group.

In the field trial, the students answered a totalof 277 questions related to the selected stimuli (five or six questions per text). The overallpercentage of correct answers to the subset ofquestions related to each text was computed foreach country in each language group. Table 17shows the average results by type of stimuli.

In both English and French-speaking countries,the item difficulty appeared to be higher in thegroup of test units where the French stimuli wereonly slightly longer than the English version,whereas, when the French stimuli weresignificantly longer, it proved to be easier.

The mean per cent of correct answers wasslightly higher in the English-speaking countriesor communities for all groups of test units, butmore so for the groups of units containing thestimuli with the largest increase in length in theFrench version. This group-by-languageinteraction was significant (F=3.62, p<0.5),indicating that the longer French units tended tobe (proportionally) more difficult for French-speaking students than those with only slightlygreater word count.

65

Table 17: Percentage of Correct Answers in English and French-Speaking Countries or Communities forGroups of Test Units with Small or Large Differences in Length of Stimuli in the Source Languages

English-Speaking Countries French-Speaking Countriesor Communities or Communities

Word Count in Mean Standard Mean StandardFrench versus English Per Cent Deviation Per Cent Deviation

Correct Correct

French less than 5 per cent longer 59.7 11.3 58.1 13.4than English (10 units)

French between 6 and 10 percent 63.1 13.4 59.1 15.0longer than English (10 units)

French between 11 and 20 per cent 65.0 13.7 62.1 14.1longer than English (15 units)

French more than 20 per cent 68.2 11.8 63.7 12.6longer than English (14 units)

Note: N=490 (10 countries and 49 texts). Tra

nsl

atio

n a

nd

cu

ltu

ral

app

rop

riat

enes

s of

th

e te

st a

nd

su

rvey

mat

eria

l

This pattern suggests that the additionalburden to the reading tasks in countries usingthe longer version does not seem to besubstantial, but the hypothesis of some effect onthe students’ performance cannot be discarded.

Linguistic difficultyReadability indices were computed for the Frenchpassages, using Flesch-Delandsheere(De Landsheere, 1973; Flesch, 1949) and Henry(1975) formulae. For some 26 English texts, read-ability indices were computed using Fry (1977),Dale and Chall (1948) and Spache (1953)formulae.

Most of these formulae use similar indicatorsto quantify the linguistic density of the texts. Themost common indicators, included in both theFrench and English formulae, are: (i) averagelength of words (lexical difficulty); (ii) percentageof low-frequency words (abstractness); and (iii) average length of sentences (syntacticalcomplexity).

Metrics vary across languages, and a numberof other factors—sometimes language-specific—are used in each of the formulae, which preventstrue direct comparison of the indices obtainedfor the English and French versions of a sametext. Therefore, the means in Table 18 belowmust be considered with caution, whereas thecorrelations between the English and Frenchindices are more reliable.

The correlations were all reasonably high orvery high, indicating that the stimuli with higherindices of linguistic difficulty in English tended

to also have higher difficulty indices than otherpassages in French. That is, English texts withmore abstract or more technical vocabulary, orwith longer and more complex sentences, tendedto show the same characteristics when translatedinto French.

Psychometric Quality of the Frenchand English Versions

Using the statistics from the field trial itemanalyses, all items presenting one or more of thefollowing flaws were identified in each nationalversion of the instruments:• a DIF (i.e., the item proved to be significantly

easier (or harder) than in most otherversions);

• fit index was too high (> 1.20); or• discrimination index was too low (< 0.15).

Some 30 items (of the 561 reading, mathem-atics, and science items included in the field trialmaterial) appeared to have problems in all ormost of the 32 participating countries. Since, inthese cases, one can be rather confident that theflaw had to do with the item content rather thanwith the quality of the translation, all relatedobservations were discarded from thecomparisons. A few items with no statistics insome of the countries (items with 0 per cent or100 per cent correct answers) also had to bediscarded. Table 19 shows the distribution offlawed items in the English-speaking and French-speaking countries or communities.

Table 19 clearly shows a very similar pattern

PIS

A 2

00

0 t

ech

nic

al r

epor

t

66

Table 18: Correlation of Linguistic Difficulty Indicators in 26 English and French Texts

English French Correlation

Mean Standard Mean StandardDeviation Deviation

Average Word Length 4.83 char. 0.36 5.09 char. 0.30 0.60

Percentage of Low-Frequency 18.7% 8.6 21.5% 5.4 0.61Words

Average Sentence Length 18 words 6.7 21 words 7.1 0.92

Average Readability Index Dale-Chall: Dale-Chall: Henry: Henry:32.84 9.76 0.496 0.054 0.72

Flesch-DL: Flesch-DL:38.62 4.18 0.83

in the two groups of countries. The percentageof flawed items in the field trial material variedfrom 5.8 to 9.4 per cent in the English-speakingcountries or communities and from5.3 to 8.9 per cent in the French-speaking countriesor communities, and had almost identical means(English: 7.5 per cent, French 7.7 per cent;F=0.05, p>0.83).

The details by item show that only one item(R228Q03) was flawed in all four French-speakingcountries or communities, but in none of theEnglish-speaking countries or communities. Threeother items (R069Q03A, R069Q05 and R247Q01)were flawed in three of four French-speakingcountries or communities, but in none of theEnglish-speaking countries or communities. Con-versely, only four items had flaws in all six English-speaking countries or communities or in four orfive of them, but only in one French-speakingcountry or community (R070Q06, R085Q06,R088Q03, R119Q10). None of the science or math-ematics items showed the same kind of imbalance.

Tra

nsl

atio

n a

nd

cu

ltu

ral

app

rop

riat

enes

s of

th

e te

st a

nd

su

rvey

mat

eria

l

67

Table 19: Percentage of Flawed Items in the English and French National Versions

Number of Too Too Large Low Total Items Easy Hard Fit Discrim- Percentage of

(%) (%) (%) ination Potentially (%) Flawed Items

Australia 532 0.4 0.0 5.6 3.0 7.7

Canada (Eng.) 532 0.0 0.6 3.4 3.6 6.6

Ireland 527 0.0 0.4 5.9 3.6 7.8

New Zealand 532 0.2 0.0 6.8 1.9 7.5

United Kingdom 530 0.0 0.0 8.9 1.5 9.4

United States 532 0.4 0.6 4.1 1.7 5.8

English-Speaking 3 185 0.1 0.3 5.8 2.5 7.5Countries or Communities

Belgium (Fr.) 531 0.0 0.6 5.6 3.8 7.9

Canada (Fr.) 531 0.0 0.2 5.6 5.6 8.9

France 532 0.2 0.9 2.8 2.4 5.3

Switzerland (Fr.) 530 0.0 0.2 3.6 6.2 8.7

French-Speaking 2 124 0.4 0.4 4.4 4.5 7.7Countries or Communities

Psychometric Quality of the National VersionsIn the countries with instruction languages otherthan English and French, the national versions ofthe PISA field trial instruments were developedeither through the recommended procedures, orthrough the following alternative methods.3

• Double translation from English, with cross-checks against French done in Denmark,Finland, Poland and in the German-speakingcountries or communities (Austria, Germany,Switzerland (Ger.)).

• Double translation from English, without cross-checks against French done in the Spanish andPortuguese-speaking countries or communities(Mexico and Spain, Brazil and Portugal).

• Single translation from English or fromFrench done in Greece, Korea, Latvia andthe Russian Federation. Most of the materialwas single-translated from English in thesecountries. In Greece and in Latvia, a small

3 Not all of these were approved.

part of the material was single-translated fromFrench and the remainder from English.

• Mixed methods e.g., Luxembourg hadbilingual booklets, the French material wasadapted from French, and the Germanmaterial was adapted from the ‘common’German version. Japan had reading stimulidouble translated from English and Frenchand the reading items and the mathematicsand science material double translated fromEnglish. Italy and Switzerland (Ital.) singletranslated the material, one from English, theother from French to reconcile the two

versions, but they ran out of time and wereable to reconcile only part of the readingmaterial. Therefore they both kept theremaining units single translated, with somechecks against the version derived from theother source language.In each country, potentially flawed items were

identified using the same procedure as for theEnglish and French-speaking countries orcommunities (explained above). Table 20 showsthe percentage of flawed items observed bymethod and by country.

The results show between-country variation in

PIS

A 2

00

0 t

ech

nic

al r

epor

t

68

Table 20: Percentage of Flawed Items by Translation Method

Method Mean Country Total PercentagePercentage Number of of Potentiallyof Flawed Items Flawed ItemsItems per

Translation Method

Adaptation from Source Version 7.6 Adapted from English 3 185 7.5Adapted from French 2 124 7.7

Double Translation from 8.0 Belgium (Fl.) 532 9.0English and French Hungary 532 7.5

Iceland 532 7.3Netherlands 532 10.3Norway 532 7.0Sweden 532 7.0

Double Translation from English 8.8 Austria 532 5.6(with cross-checks against French) Germany 532 7.9

Switzerland (Ger.) 532 10.5Denmark 532 7.5Finland 532 8.5Poland 532 12.8

Double Translation from English 12.1 Brazil 532 13.9(without use of French) Portugal 532 10.3

Mexico 532 16.0Spain 532 8.3

Single Translation 11.1 Greece 532 9.4Korea 532 16.0Latvia 532 9.2Russian Fed. 532 9.8

Other (Mixed Methods) 10.3 Czech Republic 532 9.4Italy 532 9.2Switzerland (Ital.) 532 13.9Japan 532 13.2Luxembourg 532 5.6

the number of flawed items within each group ofcountries using the same translation method, thisindicates that the method used was not the onlydeterminant of the psychometric qualityachieved in developing the instruments. Otherimportant factors were probably the accuracy ofthe national translators and reconcilers, and thequality of the work done by the internationalverifiers.

Significance tests of the differences betweenmethods are shown in Table 21. These resultsseem to confirm the hypothesis that therecommended procedure (Method B: doubletranslation from English and French) producednational versions that did not differ significantlyin terms of the number of flaws from theversions derived through adaptation from one ofthe source languages. Double translation from asingle language also appeared to be effective, butonly when accompanied by extensive cross-checks against the other source (Method C).

The average number of flaws was higher in allother groups of countries than in those that usedboth sources (either double translating from thetwo languages or using one source for doubletranslation and the other for cross-checks).

Methods E (single translation) and F (doubletranslation from one language, with no cross-checks) proved to be the least trustworthymethods.

Discussion

The analyses discussed in this chapter show thatthe relative linguistic complexity of the readingstimuli in English and French field trial versionsof the PISA assessment materials (as measuredusing readability formulae) was reasonablycomparable. However, the absolute differences inword and character counts between the twoversions were significant, and had a (modest)effect on the difficulty of the items associatedwith those stimuli that were much longer in theFrench version than in the English version.

The average length of words and sentencesdiffers across languages and obviously cannot beentirely controlled by translators when theyadapt test instruments. In this respect, it isprobably impossible to develop ‘true’ equivalentversions of tests involving large amounts ofwritten material. However, longer languagesoften have slightly more redundant T

ran

slat

ion

an

d c

ult

ura

l ap

pro

pri

aten

ess

of t

he

test

an

d s

urv

ey m

ater

ial

69

Table 21: Differences Between Translation Methods

B. C. D. E. F.Double Double Other Single Double

Translation Translation Methods Translation TranslationEnglish and Checks with No Checks

and French

A. A>B A>C A>D A>E A>FAdapted from Sources F=0.45 F=1.73 F=5.23 F=4.24 F=13.79

p=0.51 p=0.21 p=0.04 p=0.06 p=0.02

B. B>C B>D B>E B>FDouble Translation F= 0.45 F=2.27 F=4.37 F=7.12English and French p=0.52 p=0.17 p=0.07 p=0.03

C. C>D C>E C>FDouble Translation F=0.69 F=1.58 F=3.14and Checks p=0.43 p=0.24 p=0.11

D. D>E D>FOther Methods F=0.14 F=0.66

p=0.72 p=0.44

E. E>FSingle Translation F=0.19

p=0.68

morphological or syntactical characteristics,which may help compensate part of theadditional burden on reading tasks, especially intest situations like PISA, with no strongrequirements for speed.

No significant differences were observedbetween the two source versions in the overallnumber and distribution of flawed items.

With a few exceptions, the number oftranslation errors in the field trial nationalversions translated from the source materialsremained acceptable. Only nine participants hadmore than 10 per cent of flawed items comparedto the average of 7.6 per cent observed incountries that used one source version with onlysmall national adaptations.

The data seem to support the hypothesis thatdouble translation from both (rather than one)source versions resulted in better translations,with a lower incidence of flaws. Doubletranslation from one source with extensive cross-checks against the other source was an effectivealternative procedure.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

70

Field Operations

Nancy Caldwell and Jan Lokan

Overview of Roles andResponsibilities

The study was implemented in each country by aNational Project Manager (NPM) whoimplemented the procedures prepared by theConsortium. To implement the assessment inschools the NPMs were assisted by School Co-ordinators and Test Administrators. Each NPMtypically had several assistants, working from abase location that is referred to throughout thisreport as a ‘national centre’.

National Project Managers

National Project Managers (NPMs) wereresponsible for implementing the project withintheir own country. They selected the schoolsample, ensured that schools co-operated andthen selected the student sample from a list ofeligible students provided by each school (theStudent Listing Form). NPMs could followstrictly defined international procedures forselecting the school sample and then submitcomplete reports on the process to theConsortium, or provide complete lists of schoolswith age-eligible students and have theConsortium carry out the sampling.

NPMs were also given a choice of twomethods for selecting the student sample (seeChapter 4). These two methods were: (i) usingthe PISA student sampling software prepared bythe Consortium and (ii) selecting a random startnumber and then calculating the samplinginterval following the instructions in theSampling Manual. A complete list was prepared

for each school using the Student Tracking Formthat served as the central administrationdocument for the study and linked students, testbooklets and student questionnaires.

NPMs had additional operationalresponsibilities both before and after theassessment. These included:• hiring and training Test Administrators to

administer tests in individual schools;• scheduling test sessions and deploying

external Test Administrators (therecommended model);

• translating and adapting test instruments,manuals, and other materials into the testinglanguage(s);

• submitting translated documents to theConsortium for review and approval;

• sending instructions for preparing lists ofeligible students to the selected schools;

• assembling test booklets according to thedesign and layout specified by the Consortium;

• overseeing printing of test booklets andquestionnaires;

• maintaining ties with the School QualityMonitors appointed by the Consortium(monitoring activities are described in Chapter 7);

• co-ordinating the activities of TestAdministrators and School Quality Monitors;

• overseeing the packing and shipping of allmaterials;

• overseeing the receipt of test and othermaterials from the schools, and marking anddata entry;

• sending completed materials to theConsortium; and

Chapter 6

• preparing the NPM report of activities andsubmitting it to the Consortium.

School Co-ordinators

School Co-ordinators (SCs) co-ordinated school-related activities with the national centre and theTest Administrators. Their first task was toprepare the Student Listing Form with the namesof all eligible students in the school and to sendit to the NPM so that the NPM could select thestudent sample.

Prior to the test, the SC was to:• establish the testing date and time in

consultation with the NPM;• receive the list of sampled students on the

Student Tracking Form from the NPM andupdate it if necessary, plus identify studentswith disabilities or limited test languageproficiency who could not take the testaccording to criteria established by theConsortium;

• receive, distribute and collect the SchoolQuestionnaire and then deliver it to the TestAdministrator;

• inform school staff, students and parents ofthe nature of the test and the test date, andsecure parental permission if required by theschool or education system;

• inform the NPM and Test Administrator ofany test date or time changes; and

• assist the Test Administrator with roomarrangements for the test day.On the test day, the SC was expected to

ensure that the sampled students attended thetest session(s). If necessary, the SC also madearrangements for a make-up session and ensuredthat absent students attended the make-upsession.

Test Administrators

The Test Administrators (TAs) were primarilyresponsible for administering the PISA test fairly,impartially and uniformly, in accordance withinternational standards and PISA procedures. Tomaintain fairness, a TA could not be the reading,mathematics or science teacher of the studentsbeing assessed and it was preferred that they notbe a staff member at any participating school.

Prior to the test date, TAs were trained bynational centres. Training included a thorough

review of the Test Administrator Manual (seenext section) and the script to be followedduring the administration of the test andquestionnaire. Additional responsibilitiesincluded:• ensuring receipt of the testing materials from

the NPM and maintaining their security;• co-operating fully with the SC;• contacting the SC one to two weeks prior to

the test to confirm plans;• completing final arrangements on the test day;• conducting a make-up session, if needed, in

consultation with the SC;• completing the Student Tracking Form and

the Assessment Session Report Form (a formdesigned to summarise session times, studentattendance, any disturbance to the session,etc.);

• ensuring that the number of tests andquestionnaires collected from students talliedwith the number sent to the school;

• obtaining the School Questionnaire from theSC; and

• sending the School Questionnaire and all testmaterials (both completed and not completed)to the NPM after the testing was carried out.

Documentation

NPMs were given comprehensive proceduralmanuals for each major component of theassessment.• The National Project Manager’s Manual

provided detailed information about dutiesand responsibilities, including generalinformation about PISA; field operations androles and responsibilities of the NPM, the SCand the TA; translating the manuals and testinstruments; selecting the student sample;assembling and shipping materials; datamarking and entry; and documentation to besubmitted by the NPM to the Consortium.

• The School Coodinator’s Manual described, indetail, the activities and responsibilities of theSC. Information was provided regarding allaspects, from selecting a date for theassessment to arranging for a make-up sessionand storing the copies of the assessmentforms.

• The Test Administrator’s Manual not onlyprovided a comprehensive description of theduties and responsibilities of the TA, from

PIS

A 2

00

0 t

ech

nic

al r

epor

t

72

attending the TA training to conducting amake-up session, but also included the scriptto be read during the test as well as a ReturnShipment Form (a form designed to trackmaterials to and from the school).

• The Sampling Manual provided detailedinstructions regarding the selection of theschool and student samples and the reportsthat had to be submitted to the Consortium todocument each step in the process of selectingthese samples.These manuals also included checklists and

timetables for easy reference. NPMs wererequired to fill in country specific information inaddition to adding specific dates in both theSchool Co-ordinator’s Manual and the TestAdministrator’s Manual.

KeyQuest®, a sampling software package,was prepared by the Consortium and given toNPMs as one method for selecting the studentsample from the Student Listing Form. Althoughabout half did, in fact, use this software, someNPMs used their own sampling software (whichhad to be approved by the Consortium).

Materials Preparation

Assembling Test Booklets andQuestionnaires

As described in Chapter 2, nine different testbooklets had to be assembled with clusters oftest items arranged according to the balancedincomplete block design specified by theConsortium. Test items were presented in units(stimulus material and two or more itemsrelating to the stimulus) and each clustercontained several units. Test units andquestionnaire items were initially sent to NPMsseveral months before the testing dates, to enabletranslation to begin. Units allocated to clustersand clusters allocated to booklets were provideda few weeks later, together with detailedinstructions to NPMs about how to assembletheir translated or adapted clusters into booklets.

For reference, master hard copies of allbooklets were provided to NPMs and mastercopies in both English and French were alsoavailable through a secure website. NPMs wereencouraged to use the cover design provided bythe OECD (both black and white and colouredversions of the cover design were made

available). In formatting translated or adaptedtest booklets, they had to follow as far aspossible the layout in the English master copies,including allocation of items to pages. A slightlysmaller or larger font than in the master copywas permitted if it was necessary to ensure thesame page set-up as that of the source version.

NPMs were required to submit copies of theirassembled test booklets for verification by theConsortium before printing the booklets.

The Student Questionnaire contained one,two, or three modules, according to whether theinternational options of Cross-CurricularCompetency (CCC) and Computer Familiarityor Information Technology (IT) were beingadded to the core component. About half thecountries chose to administer the IT componentand just over three-quarters used the CCC com-ponent. The core component had to be presentedfirst in the questionnaire booklet. If bothinternational options were used, the IT modulewas to be placed ahead of the CCC module.

NPMs were permitted to add questions ofnational interest as ‘national options’ to thequestionnaires. Proposals and text for these hadto be submitted to the Consortium for approval.It was recommended that, if more than twopages of additional questions were proposed andthe tests and questionnaires were beingadministered in a single session, the additionalmaterial should be placed at the end of thequestionnaire. The Student Questionnaire wasmodified more often than the SchoolQuestionnaire.

NPMs were required to submit copies of theirassembled questionnaires for verification by theConsortium prior to printing.

Printing Test Booklets andQuestionnaires

Printing had to be done such that the content ofthe instruments was constantly secure. Giventhat the test booklet was administered in twoparts to give students a brief rest after the firsthour, the second half of the booklet had to bedistinguishable from the first half so that TAscould see if a student was working in the wrongpart. It was recommended that this be doneeither by having a shaded strip printed across thetop (or down the outside edge) of pages in thesecond half of the booklet, or by having the

Fie

ld o

per

atio

ns

73

second half of the booklet sealed with a sticker.If the Student Questionnaire was to be

included in the same booklet as the test items,then the questionnaire section had to beindicated in a different way from the two partsof the test booklet. Although this complicatedthe distribution of materials, it was simpler toprint the questionnaire in a separate booklet andmost countries did so.

Packaging and Shipping Materials

Regardless of how materials were packaged andshipped, the following needed to be sent eitherto the TA or to the school:• test booklets and Student Questionnaires for

the number of students sampled;• Student Tracking Form;• two copies of the Assessment Session Report

Form;• Packing Form;• Return Shipment Form;• additional materials, e.g., rulers and

calculators, as per local circumstances; and• additional School and Student Questionnaires

and a bundle of extra test booklets.Of the nine separate test booklets, one could

be pre-allocated to each student by theKeyQuest® software from a random startingpoint in each school. KeyQuest® could then beused to generate the school’s Student TrackingForm, which contained the number of theallocated booklet alongside each sampledstudent’s name. Instructions were also providedfor carrying out this step manually, starting withBooklet 1 for the first student in the first school,and assigning booklets in order up to the 35th

student for the designated number of students inthe sample. With 35 students (the recommendednumber of students to sample per school), thelast student in the first school would be assignedBooklet 8. At the second school, the first studentwould be assigned Booklet 9, the second,Booklet 1, and so on. In this way, the bookletswould be rotated so that each would be usedmore or less equally, as they also would be ifallocated from a random starting point.

It was recommended that labels be printed,each with a student identification number andtest booklet number allocated to thatidentification, plus the student’s name if this wasan acceptable procedure within the country. Two

or three copies of each student’s label could beprinted, and used to identify the test booklet, thequestionnaire, and a packing envelope if used.

NPMs were allowed some flexibility in howthe materials were packaged and distributed,depending on national circumstances. It wasspecified however that the test booklets for aschool be packaged so that they remainedsecure, possibly by wrapping them in clearplastic and then heat sealing the package, or bysealing each booklet in a labelled envelope.

Three scenarios, summarised here, were des-cribed as illustrative of acceptable approaches topackaging and shipping the assessmentmaterials:• Country A − all assessment materials shipped

directly to the schools (with fax-back formsprovided for SCs to acknowledge receipt ofmaterials); school staff (not teachers of thestudents in the assessment) used to conductthe testing sessions; test booklets and StudentQuestionnaires printed separately; materialsassigned to students before packaging forshipment; each test booklet andquestionnaire labelled, and then sealed in anidentically labelled envelope for shipping tothe school.

• Country B − all assessment materials shippeddirectly to the schools (with fax-back formsprovided for SCs to acknowledge receipt ofmaterials); test sessions conducted by TAsemployed by the national centre; test bookletsand Student Questionnaires printed separately,and labelled and packaged in separatelybound bundles assembled in Student TrackingForm order; on completion of the assessment,each student places and seals his/her testbooklet and questionnaire in a labelledenvelope provided to protect studentconfidentiality within the school.

• Country C − test sessions conducted by TAsemployed by the national centre, withassessment materials for the scheduled schoolsshipped directly to the TAs; test and StudentQuestionnaires printed in the one booklet,with a black bar across the top of each pagein the second part of the test; bundles of 35booklets loosely sealed in plastic, so that theirnumber can be checked without opening thepackages; school packages openedimmediately prior to the session by the TAand identification number and name labels

PIS

A 2

00

0 t

ech

nic

al r

epor

t

74

affixed according to the assignment ofbooklets pre-recorded on the Student TrackingForm at the national centre.

Receipt of Materials at the NationalCentre After Testing

It was recommended that the national centreestablish a database of schools before testingbegan with fields for recording shipment ofmaterials to and from schools, tallies ofmaterials sent and tallies of completed andunused booklets returned, and for various stepsin processing booklets after the testing.KeyQuest® could be used for this purpose ifdesired. TAs recorded the student’s participationstatus (present/absent) on the Student TrackingForm for each of the two test parts and for theStudent Questionnaire. This information was tobe checked at the national centre during theunpacking step.

Processing Tests andQuestionnaires AfterTesting

This section describes PISA’s markingprocedures, including multiple marking, and alsomakes brief reference to pre-coding of responsesto a few items in the Student Questionnaire.Because about one-third of the mathematics andscience items and more than 40 per cent of thereading items required students to write in theiranswers, the answers had to be evaluated andmarked (scored). This was a complex operation,as booklets had to be randomly assigned tomarkers and, for the minimum recommendedsample size per country of 4 500 students, morethan 120 000 responses had to be evaluated.Each of the nine booklets had between 20 and31 items requiring marking.

It is crucial for comparability of results in astudy such as PISA that students’ responses bescored uniformly from marker to marker andfrom country to country. Comprehensive criteriafor marking, including many examples ofacceptable and not acceptable responses, wereprepared by the Consortium and provided toNPMs in Marking Guides for each of reading,mathematics and science.

Steps in the Marking Process

In setting up the marking of students’ responsesto open-ended items, NPMs had to carry out oroversee several steps:• adapt or translate the Marking Guides as

needed;• recruit and train markers;• locate suitable local examples of responses to

use in training and practice;• organise booklets as they were returned from

schools;• select booklets for multiple marking;• single mark booklets according to the

international design;• multiple mark a selected sub-sample of

booklets once the single marking wascompleted;

• submit a sub-sample of booklets for the Inter-Country Rater Reliability Study (seeChapter 10).Detailed instructions for each step were

provided in the NPM Manual. Key aspects of theprocess are included here.

International trainingRepresentatives from each national centre wererequired to attend two international markertraining sessions—one immediately prior to thefield trial and one immediately prior to the mainstudy. At the training sessions, Consortium stafffamiliarised national centre staff with themarking guides and their interpretation.

StaffingNPMs were responsible for recruitingappropriately qualified people to carry out thesingle and multiple marking of the test booklets.In some countries, pools of experienced markersfrom other projects could be called on. It wasnot necessary for markers to have high-levelacademic qualifications, but they needed to havea good understanding of either mid-secondarylevel mathematics and science or the language ofthe test, and to be familiar with ways in whichsecondary-level students express themselves.Teachers on leave, recently retired teachers andsenior teacher trainees were all considered to bepotentially suitable markers. An importantfactor in recruiting markers was that they couldcommit their time to the project for the durationof the marking, which was expected to take upto two months.

Fie

ld o

per

atio

ns

75

The Consortium provided a MarkerRecruitment Kit to assist NPMs in screeningapplicants. These materials were similar innature to the Marking Guides, but were muchbriefer. They were designed so that applicantswho were considered to be potentially suitablecould be given a brief training session, afterwhich they marked some student responses.Guidelines for assessing the results of thisexercise were supplied. The materials alsoprovided applicants with the opportunity toassess their own suitability for the task.

The number of markers required was governedby the design for multiple marking (described ina later section). For the main study, markers wererequired in multiples of eight, with 16 as the recom-mended number for reading, and eight—whocould mark both mathematics and science—recommended for those areas. These numbers ofmarkers were considered to be adequate forcountries testing between 4 500 (the minimumnumber required) and 6 000 students to meet thetimeline of submitting their data within threemonths of testing.

For larger numbers of students or in caseswhere reading markers could not be obtained inmultiples of eight, NPMs could prepare theirown design for reading marking and submit it tothe Consortium for approval only a handful ofcountries did this. For NPMs who were unableto recruit people who could mark bothmathematics and science, an alternative designinvolving four mathematics markers and fourscience markers was provided (but this designhad the drawback that it yielded lesscomprehensive information from the multiplemarking). Given that several weeks wererequired to complete the marking, it wasrecommended that at least two back-up readingmarkers and one back-up mathematics/sciencemarker be trained and included in at least someof the marking sessions.

The marking process was complex enough torequire a full-time overall supervisor of activitieswho was familiar with logistical aspects of themarking design, the procedures for checkingmarker reliability, the marking schedules andalso the content of the tests and MarkingGuides.

NPMs were also required to designate personswith subject-matter expertise, familiarity withthe PISA tests and, if possible, experience in

marking student responses to open-ended items,in order to act as ‘table leaders’ during themarking. Good table leaders were essential tothe quality of the marking, as their main rolewas to monitor markers’ consistency in applyingthe marking criteria. They also assisted with theflow of booklets, and fielded and resolvedqueries about the Marking Guide and aboutparticular student responses in relation to theguide, consulting the supervisor as necessarywhen queries could not be resolved. Thesupervisor was then responsible for checkingsuch queries with the Consortium.

Depending on their level of experience and onthe design followed for mathematics/science,between two and four table leaders wererecommended for reading and either one or twofor mathematics/science. Table leaders wereexpected to participate in the actual markingand spend extra time monitoring consistency.

Several persons were needed to unpack,check, and assemble booklets into labelledbundles so that markers could respect thespecified design for randomly allocating sets ofbooklets to markers.

Confidentiality formsBefore seeing or receiving any copies of PISA testmaterials, prospective markers were required tosign a Confidentiality Form, obligating them notto disclose the content of the PISA tests beyondthe groups of markers and trainers with whomthey would be working.

National trainingAnyone who marked the PISA main survey testbooklets had to participate in specific trainingsessions, regardless of whether they had hadrelated experience or had been involved in thePISA field trial marking. To assist NPMs incarrying out the training, the Consortium preparedtraining materials in addition to the detailedMarking Guides. Training within a country couldbe carried out by the NPM or by one or moreknowledgeable persons appointed by the NPM.Subject matter knowledge was important for thetrainer as was understanding the procedures,which usually meant that more than one personwas involved in leading the training.

Training sessions were organised as a functionof how marking was done. The recommendationwas to mark by cluster, completing the marking

PIS

A 2

00

0 t

ech

nic

al r

epor

t

76

of each item separately within a cluster beforemoving to the next item, and completing onecluster before moving to the next. Therecommended allocation of booklets to markersassumed marking by cluster, though analternative design involving marking by bookletwas also provided.

If marking was done by cluster, then markerswere trained by cluster for the nine readingclusters and four clusters for each of mathematicsand science. If prospective markers had done asrecommended and worked through the testbooklets and read through the Marking Guides inadvance, training for each cluster took about halfa day. During a training session, the trainerreviewed the Marking Guide for a cluster of unitswith the markers, then had the markers assignmarks to some sample items for which theappropriate marks had been supplied by theConsortium. The trainer reviewed the results withthe group, allowing time for discussion, queryingand clarification of reasons for the pre-assignedmarks. Trainees then proceeded to markindependently some local examples that had beencarefully selected by the supervisor of marking inconjunction with national centre staff.

Training was more difficult if marking wasdone by booklet. Each booklet contained eithertwo or three 30-minute reading clusters andfrom two to four 15-minute mathematics/scienceclusters (except Booklet 7, which had fourreading clusters). Thus, markers had to betrained in several clusters before they couldbegin marking a booklet.

It was recommended that prospective markersbe informed at the beginning of training thatthey would be expected to apply the MarkingGuides with a high level of consistency, and thatreliability checks would be made frequently bytable leaders and the overall supervisor as partof the marking process.

Ideally, table leaders were trained before thelarger groups of markers since they needed to bethoroughly familiar with both the test items andthe Marking Guides. The marking supervisorexplained these to the point where the tableleaders could mark and reach a consensus on theselected local examples to be used later with thelarger group of trainees. They also participatedin the training sessions with the rest of themarkers, partly to strengthen their ownknowledge of the Marking Guides and partly to

assist the supervisor in discussions with the traineesof their pre-agreed marks to the sample items.

Table leaders received additional training inthe procedures for monitoring the consistencywith which markers applied the criteria.

Length of marking sessionsMarking responses to open-ended items ismentally demanding, requiring a level ofconcentration that cannot be maintained forlong periods of time. It was thereforerecommended that markers work for no morethan six hours per day on actual marking, andtake two or three breaks for coffee and lunch.Table leaders needed to work longer on mostdays so that they had adequate time for theirmonitoring activities.

Logistics Prior to Marking

Sorting booklets

When booklets arrived back at the nationalcentre, they were first tallied and checkedagainst the session participation codes on theStudent Tracking Form. Unused and usedbooklets were separated; used booklets weresorted by student identification number if theyhad not been sent back in that order and thenwere separated by booklet number; and schoolbundles were kept in school identification order,filling in sequence gaps as packages arrived.Student Tracking Forms were carefully filed inring binders in school identification order. If theschool identification number order did notcorrespond with the alphabetical order of schoolnames, it was recommended that an index ofschool name against school identification beprepared and kept with the binders.

Because of the time frame within whichcountries had to have all their marking done anddata submitted to the Consortium, it was usuallyimpossible to wait for all materials to reach thenational centre before beginning to mark. Inorder to manage the design for allocatingbooklets to markers, however, it wasrecommended to start marking only when atleast half of the booklets had been returned.

Selection of booklets for multiple markingThe Technical Advisory Group decided to setaside in each country 48 each of Booklets 1 to 7and 72 each of Booklets 8 and 9 for multiple

Fie

ld o

per

atio

ns

77

marking. The mathematics and science clusterswere more thinly spread in Booklets 1 to 7 thanin Booklets 8 and 9. The larger number ofBooklets 8 and 9 was specified to have sufficientdata on which to base the marker reliabilityanalyses for mathematics and science.

The main principle in setting aside thebooklets for multiple marking was that theselection needed to ensure a wide spread ofschools and students across the whole sampleand to be random as far as possible. Thesimplest method for carrying out the selectionwas to use a ratio approach based on theexpected total number of completed booklets,combined with a random starting point.

In the minimum recommended student sampleof 4 500 students per country, approximatelyevery tenth booklet of Booklets 1 to 7 and everyseventh booklet of Booklets 8 and 9 needed to beset aside. Random numbers between 1 and 10and between 1 and 7 needed to be drawn as thestarting points for selection. Depending on theactual numbers of completed booklets received,the selection ratios needed to be adjusted so thatthe correct numbers of each booklet were selectedfrom the full range of participating schools.

Booklets for single markingOnly one marker was required to mark all bookletsremaining after those for multiple marking hadbeen set aside. For the minimum required samplesize of 4 500, there would have been approximately450 of each of Booklets 1 to 7 and 430 of eachof Booklets 8 and 9 in this category.

Some items requiring marking did not need tobe included in the multiple marking. The lastmarker in the multiple-marking process markeditems in the booklets set aside for multiplemarking.

How Marks Were Shown

A string of small code numbers corresponding tothe possible codes for the item as delineated inthe relevant Marking Guide (including the codefor ‘not applicable’ to allow for a misprinteditem) appeared in the upper right-hand side ofeach item in the test booklets requiringjudgement. For booklets being processed by asingle marker, the mark assigned was indicateddirectly in the booklet by circling theappropriate code number alongside the item.

Tailored marking record sheets were preparedfor each booklet for the multiple marking andused by all but the last marker so that eachmarker undertaking multiple marking did notknow which marks other markers had assigned.

For the reading tests, item codes were oftenjust 0, 1, 9 and ‘n’, indicating incorrect, correct,missing and ‘not applicable’, respectively.Provision was made for some of the open-endeditems to be marked as partially correct, usuallywith ‘2’ as fully correct and ‘1’ as partiallycorrect, but occasionally with three degrees ofcorrectness indicated by codes of ‘1’, ‘2’ and ‘3’.

For the mathematics and science tests, a two-digit coding scheme was adopted for the itemsrequiring constructed responses. The first digitrepresented the ‘degree of correctness’ mark, asin reading; the second indicated the content ofthe response or the type of solution method usedby the student. Two-digit codes were originallyproposed by Norway for the Third InternationalMathematics and Science Study (TIMSS) andwere adopted in PISA because of their potential foruse in studies of student learning and thinking.

Marker IDentification Numbers

Marker identification numbers were assignedaccording to a standard three-digit formatspecified by the Consortium. The first digit hadto show if the marker was a reading, mathem-atics or science marker (or mathematics/science),and the second and third digits had to uniquelyidentify the markers within their set. Markeridentification numbers were used for twopurposes: implementing the design for allocatingbooklets to markers; and in monitoring markerconsistency in the multiple-marking exercises.

Design for Allocating Booklets to Markers

Reading

If marking was done by cluster, each readingmarker needed to handle three of the ninebooklet types at a time because the clusters werefeatured in three booklets, except for the ninthcluster, which appeared in only two of thebooklets. For example, reading Cluster 1occurred in Booklets 1, 5 and 7, which thereforehad to be marked before any other items inanother cluster. Moreover, since marking was

PIS

A 2

00

0 t

ech

nic

al r

epor

t

78

done item by item, the item was marked acrossthree booklets before the next item was marked.

A design to ensure the random allocation ofbooklets to reading markers was prepared basedon the recommended number of 16 markers andthe minimum sample size of 4 500 students from150 schools.1 Booklets were to be sorted by studentindentification within schools. With 150 schoolsand 16 markers, each marker had to mark a clusterwithin a booklet from eight or nine schools (150 ÷ 16 ≅ 9). Figure 6 shows how bookletsneeded to be assigned to markers for the readingsingle marking. Further explanation of theinformation in this table is presented below.

According to this design, Cluster 1 in subset 1(schools 1 to 9) was to be marked by ReadingMarker 1 (M1 in Figure 6), Cluster 1 in subset 2(schools 10 to 18) was to be marked by ReadingMarker 2 (M2 in Figure 6), and so on. ForCluster 2, Reading Marker 1 was to mark allfrom subset 2 (schools 10 to 18) and ReadingMarker 2 was to mark all from subset 3 (schools19 to 27). Subset 1 of Cluster 2 (schools 1 to 9)was to be marked by Reading Marker 16.

If booklets from all participating schools wereavailable before the marking began, the followingsteps would be involved in implementing the design:

Fie

ld o

per

atio

ns

79

Subsets of Schools

Cl Bklt 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

R1 1,5,7 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 M16








R9 8,9 M9 M10 M11 M12 M13 M14 M15 M16 M1 M2 M3 M4 M5 M6 M7 M8

Figure 6: Allocation of the Booklets for Single Marking of Reading by Cluster

Step 1: Set aside booklets for multiple markingand then divide the remaining booklets intoschool subsets as above; (subset 1: schools 1 to 9;subset 2: schools 10 to 18, etc., to achieve 16subsets of schools).

Step 2: Assuming that marking begins withCluster 1:Marker 1 takes Booklets 1, 5 and 7 for SchoolSubset 1;Marker 2 takes Booklets 1, 5 and 7 for SchoolSubset 2; etc.; untilMarker 16 takes Booklets 1, 5 and 7 forSchool Subset 16.

Step 3: Markers mark all of the first Cluster 1item requiring marking in the booklets thatthey have 1, 5 and 7.

Step 4: The second Cluster 1 item is marked inall three booklet types, followed by the thirdCluster 1 item, etc., until all Cluster 1 itemsare marked.

Step 5: For Cluster 2, as per the row of thetable in Figure 6 corresponding to R2 in theleft-most column, each marker is allocated asubset of schools different from their subsetfor Cluster 1. Marking proceeds item by itemwithin the cluster.

Step 6: For the remaining clusters, the rowscorresponding to R3, R4, etc., in the table arefollowed in succession.

1 Countries with more or fewer than 150 schools or adifferent number of markers had to adjust the size ofthe school subsets accordingly.

As a result of this procedure, the booklets fromeach subset of schools are processed by ninedifferent reading markers and each student’sbooklet is marked by three different readingmarkers (except Booklet 7, which has four readingclusters, and Booklets 8 and 9, which have onlytwo reading clusters). Spreading booklets amongmarkers in this way minimises the effects of anysystematic leniency or harshness in marking.

In practice, most countries would not have hadcompleted test booklets back from all their sampledschools before marking needed to begin. NPMs wereencouraged to organise the marking in two waves,so that it could begin after materials were receivedback from one-half of their schools. Schools wouldnot have been able to be assigned to school sets formarking exactly in their school identificationorder, but rather by identification order combinedwith when their materials were received andprocessed at the national centre.

Mathematics and scienceWith eight mathematics/science markers whocould mark in both areas, the booklets needed to

be assigned to markers according to Figure 7 formarking by cluster. Here, the subsets of schoolswould contain twice the number of schools asfor the reading marking, because schools wereshared among eight and not 16 markers. Thiscould be done by combining subsets 1 and 2from reading into subset 1 for mathematics/science, and so on unless the mathematics/science marking was done before the reading, inwhich case the reading subsets could be formedby halving the mathematics/science subsets.(Logistics were complex because all of thebooklets that contained mathematics or scienceitems also contained reading items, and had tobe shared with the reading markers.)

It was recommended that the marking in onestage be completed before marking in the otherbegan. Stage 2 could be undertaken before Stage 1, if this helped with the flow of booklets.

If separate markers were needed formathematics and science, the single markingcould be accomplished with the design shown inFigure 8, in which case schools needed to bedivided into only four subsets.

80

Subsets of SchoolsStage Cluster Booklet 1 2 3 4 5 6 7 8

1: Mathematics M1 1,9 M1 M2 M3 M4 M5 M6 M7 M8

M2 1,5,8 M7 M8 M1 M2 M3 M4 M5 M6

M3 3,5,9 M5 M6 M7 M8 M1 M2 M3 M4

M4 3,8 M3 M4 M5 M6 M7 M8 M1 M2

2: Science S1 2,8 M1 M2 M3 M4 M5 M6 M7 M8

S2 2,6,9 M7 M8 M1 M2 M3 M4 M5 M6

S3 4,6,8 M5 M6 M7 M8 M1 M2 M3 M4

S4 4,9 M3 M4 M5 M6 M7 M8 M1 M2

Figure 7: Booklet Allocation for Single Marking of Mathematics and Science by Cluster, Common Markers

Mathematics Science Subsets of Schools

Cluster Booklet Cluster Booklet 1 2 3 4

M1 1,9 S1 2,8 M1 M2 M3 M4

M2 1,5,8 S2 2,6,9 M2 M3 M4 M1

M3 3,5,9 S3 4,6,8 M3 M4 M1 M2

M4 3,8 S4 4,9 M4 M1 M2 M3

Figure 8: Booklet Allocation for Single Marking of Mathematics and Science by Cluster, Separate Markers

PIS

A 2

00

0 t

ech

nic

al r

epor

t

SE booklet (Booklet 0)Countries using the shorter, special purpose SEbooklet (numbered as Booklet 0) were advised toprocess this separately from the remainingbooklets. Small numbers of students used thisbooklet, only a few items required marking, andthey were not arranged in clusters. NPMs werecautioned that booklets needed to be allocatedto several markers to ensure uniform applicationof marking criteria for the SE booklet, as for themain marking.

Managing the Actual Marking

Booklet flow

To facilitate the flow of booklets, it was importantto have ample table surfaces on which to placeand arrange them by type and school subset. Thebundles needed to be clearly labelled. For thispurpose, it was recommended that each bundleof booklets be identified by a batch header foreach booklet type (Booklets 1 to 9), with spacesfor the number of booklets and school identific-ations represented in the bundle to be written in.In addition, each header sheet was to be pre-printed with a list of the clusters in the booklet,with columns alongside which the date and time,marker’s name and identification and tableleader’s initials could be entered as the bundlewas marked and checked.

Separating the reading and mathematics/science marking

It was recommended that reading and mathema-tics/science marking be done at least partly atdifferent times (for example, mathematics/sciencemarking could start a week or two ahead of thereading marking). Eight of the nine bookletscontained material in reading together withmaterial in mathematics and/or science. Exceptfor Booklet 7, which contained only reading, itcould have been difficult to maintain an efficientflow of booklets through the marking process ifall marking in all three domains were donesimultaneously.

Familiarising markers with the markingdesign

The relevant design for allocating booklets tomarkers was explained either during the markertraining session or at the beginning of the firstmarking session (or both). The marking

supervisor was responsible for ensuring thatmarkers adhered to the design, and used clericalassistants if needed. Markers could betterunderstand the process if each was providedwith a card indicating the bundles of booklets tobe taken and in which order.

Consulting table leadersDuring the initial training, practice, and review,it was expected that coding issues would bediscussed openly until markers understood therationale for the marking criteria (or reachedconsensus where the Marking Guide wasincomplete). Markers were advised to workquietly, referring queries to their table leaderrather than to their neighbours. If a particularquery arose often, the table leader was advisedto discuss it with the rest of the group.

Markers were not permitted to consult othermarkers or table leaders during the additionalpractice exercises (see next subsection) to gaugewhether all or some markers needed more trainingand practice, or during the multiple marking.

Monitoring single markingThe steps described here represent the minimumlevel of monitoring activities required. Countrieswishing to implement more extensive monitoringprocedures during single marking wereencouraged to do so.

The supervisor, assisted by table leaders, wasadvised to collect markers’ practice papers aftereach cluster practice session and to tabulate themarks assigned. These were then to be comparedwith the pre-agreed marks: each matching markwas considered a hit and each discrepant markwas considered a miss. To reflect an adequatestandard of reliability, the ratio of hits to thetotal of hits plus misses needed to be 0.85 ormore. In mathematics and science, this reliabilitywas to be assessed on the first digit of the two-digit codes. A ratio of less than 0.85, especiallyif lower than 0.80, was to be taken as indicatingthat more practice was needed, and possibly alsomore training.

Table leaders played a key role during eachmarking session and at the end of each day, byspot-checking a sample of booklets or items thathad already been marked to identify problemsfor discussion with individual markers or withthe wider group, as appropriate. All bookletsthat had not been set aside for multiple marking

Fie

ld o

per

atio

ns

81

were candidates for this spot-checking. It wasrecommended that, if there were indications fromthe practice sessions that one or more particularmarkers might be experiencing problems in usingthe Marking Guide consistently, then more of thosemarkers’ booklets should be included in the checking.

Table leaders were advised to review theresults of the spot-checking with the markers atthe beginning of the next day’s marking. Thiswas regarded primarily as a mentoring activity,but NPMs were advised to keep in contact withtable leaders and the marking supervisor if therewere individual markers who did not meetcriteria of adequate reliability and would need tobe removed from the pool.

Table leaders were to initial and date theheader sheet of each batch of booklets for whichthey had carried out spot-checking. Someitems/booklets from each batch and each markerhad to be checked.

Multiple Marking

For PISA 2000, multiple marking meant thatfour separate markers marked all short responseand open constructed-response items in 48 ofeach of Booklets 1 to 7 and 72 of each ofBooklets 8 and 9. Multiple marking was done ator towards the end of the marking period, aftermarkers had familiarised themselves with andwere confident in using the Marking Guides.

As noted earlier, the first three markers of theselected booklets circled codes on separaterecord sheets, tailored to booklet type andsubject area (reading, mathematics or science),using one page per student. The markingsupervisor checked that markers correctlyentered student identification numbers and theirown identification number on the sheets, whichwas crucial to data quality. Also as noted earlier,the SE booklet was not included in the multiplemarking.

In a country where booklets were provided inmore than one language, multiple marking wasrequired only in the language used for themajority of the PISA booklets, thoughencouraged in more than one language ifpossible. If two languages were used equally, theNPM could choose either language.

While markers would have been thoroughlyfamiliar with the Marking Guides by this time,they may have most recently marked a different

PIS

A 2

00

0 t

ech

nic

al r

epor

t

82

booklet from those allocated to them for multiplemarking. For this reason, they needed to havetime to re-read the relevant Marking Guide beforebeginning the marking. It was recommended thatat least half a day be used for markers to refreshtheir familiarity with the guides and to look againat the additional practice material beforeproceeding with the multiple marking.

As in the single marking, marking was to bedone item by item. For manageability, all itemswithin a booklet were to be marked beforemoving to the next booklet. Rather than markingby cluster across several booklet types, it wasconsidered that markers would be experiencedenough in applying the marking criteria by thistime that marking by booklet would be unlikelyto detract from the quality of the data.

Booklet allocation design The specified multiple marking design forreading, shown in Figure 9, assumed 16 markerswith identification numbers 201 to 216. Theimportance of following the design exactly asspecified was stressed, as it provides forbalanced links between clusters and markers.

Figure 8 shows 16 markers grouped into eightpairs of two, with Group 1 comprising the firsttwo markers (201 and 202), Group 2 the nexttwo (203 and 204), etc. The design involved twosteps, with the booklets divided into two sets.Booklets 1 to 4 made up one set, and Booklets 5to 9 the second set. The four markings were tobe carried out by allocating booklets to the fourmarkers shown in the right-hand column of thetable for each booklet.

Step Booklet Marker Marker IdentificationGroups

1 1 and 201, 202, 203, 204

2 and 205, 206, 207, 208

3 and 209, 210, 211, 212

4 and 213, 214, 215, 216

2 7 and 203, 204, 205, 206

5/6 and 207, 208, 209, 210

8 and 211, 212, 213, 214

9 and 215, 216, 201, 202

Figure 9: Booklet Allocation to Markers for Multiple Marking ofReading

In this scenario, with all 16 markers working,Booklets 1 to 4 were to be marked at the sametime in the first step. The 48 Booklet 1’s, forexample, were to be divided into four bundles of12, and rotated among markers 201, 202, 203and 204, so that each marker eventually wouldhave marked all 48 of this booklet. The samepattern was to be followed for Booklets 2, 3 and 4.Each booklet had approximately the samenumber of items requiring marking and the samenumber of booklets (48) to be marked.

After Booklets 1 to 4 had been put through themultiple-marking process, the pairs of markers wereto regroup (but remain in their pairs) and followthe allocation in the second half of Figure 9.That is, markers 203, 204, 205 and 206 were tomark Booklet 7, markers 207, 208, 209 and 210were to mark Booklets 5 and 6, and so on forthe remaining booklets. Booklets 5 and 6contained fewer items requiring marking thanBooklet 7, and there were more Booklets 8 and 9to process. The closest to a balanced workloadwas achieved by regarding Booklets 5 and 6 asequivalent to one booklet for this exercise.

If only eight markers were available, thedesign could be applied by using the groupdesignations in Figure 9 to indicate markeridentification numbers. However, four steps, nottwo, were needed to achieve four markings perbooklet. The third and fourth steps could beachieved by starting the third step with Booklet 1marked by markers 3 and 4 and continuing thepattern in a similar way as in the figure, and bystarting the fourth step with Booklet 7 markedby markers 4 and 5.

Allocating booklets to markers for multiplemarking was quite complex and the markingsupervisor had to monitor the flow of bookletsthroughout the process.

The multiple-marking design for mathematicsand science shown in Figure 10 assumed eightmarkers, with identification numbers 401 to 408,who each marked both mathematics and science.This design also provided for balanced linksbetween clusters and markers. The designrequired four stages, which could be carried outin a different order from that shown if this helpedwith booklet flow, but the allocation of bookletsto markers had to be retained. Since there weremore items to process in stage 4, it was importantfor all markers to undertake this stage together.

Fie

ld o

per

atio

ns

83

If different markers were used for mathem-atics and science, a different multiple-markingdesign was necessary. Assuming four markers foreach of mathematics and science, then each booklethad to be marked by each marker. According tothe scheme for assigning marker identificationnumbers, the markers would have had identificationnumbers: 101, 102, 103 and 104 for mathem-atics; and 301, 302, 303 and 304 for science.

Five booklets contained mathematics items(Booklets 1, 3, 5, 8 and 9), and five containedscience items (Booklets 2, 4, 6, 8 and 9). Becausethere were different numbers of booklets andvery large differences in numbers of itemsrequiring multiple marking within them, it wasimportant for all markers to work on the samebooklet at the same time.

The design in Figure 11 was suggested, with ananalogous one for science. The rotation numbersare marker identification numbers. If it helpedthe overall workflow, booklets could be markedin a different order from that shown in the figure.

Stage Booklet Domain Marker Identifications

1 1 Mathematics 401, 402, 403, 404

2 Science 405, 406, 407, 408

2 3 Mathematics 402, 403, 404, 405

4 Science 406, 407, 408, 401

3 5 Mathematics 403, 404, 405, 406

6 Science 407, 408, 401, 402

4 8 Maths and Science 404, 405, 406, 407

9 Maths and Science 408, 401, 402, 403

Figure 10: Booklet Allocation for Multiple Marking ofMathematics and Science, Common Markers

Booklet First Second Third FourthRotation Rotation Rotation Rotation

1 101 102 103 1043 102 103 104 1015 103 104 101 1028 104 101 102 1039 101 102 103 104

Figure 11: Booklet Allocation for Multiple Marking of Mathematics

Cross-national Marking

Cross-national comparability in assigning markswas explored through an inter-country raterreliability study (see Chapter 10 and Chapter 14).

Questionnaire Coding

The main coding required for the StudentQuestionnaire internationally was the mother’sand father’s occupation and student’soccupational expectation. Four-digitInternational Standard Classification ofOccupations (ISCO88) codes (InternationalLabour Organisation, 1988) were assigned tothese three variables. In several countries, thiscould be done in many ways. NPMs could use anational coding scheme with more than 100occupational title categories, provided that thisnational classification could be recoded to ISCO.A national classification was preferred becauserelationships between occupational status andachievement could then be compared within acountry using both international and nationalmeasures of occupational status.

The PISA website gave a short, clear summaryof ISCO codes and occupational titles forcountries to translate if they had neither anational occupational classification scheme noraccess to a full translation of ISCO.

In their national options, countries may alsohave needed to pre-code responses to some itemsbefore data from the questionnaire were enteredinto the software.

Data Entry, Data Checkingand File Submission

Data Entry

The Consortium provided participating countrieswith data entry software (KeyQuest®) that ranunder Windows 95® or later, and Windows NT4.0® or later. KeyQuest® contained thedatabase structures for all the booklets andquestionnaires used in the main survey. Variablescould be added or deleted as needed for nationaloptions. Data were to be entered directly fromthe test booklets and questionnaires, except forthe multiple-marking study, where the marksfrom the first three markers had been written onseparate sheets.

KeyQuest® performed validation checks as

data were entered. Importing facilities were alsoavailable if data had already been entered intotext files, but it was strongly recommended thatdata be entered directly into KeyQuest® to takeadvantage of its many PISA-specific features.

A separate Data Entry Manual provided fulldetails of the functionality of the KeyQuest®software and complete instructions on dataentry, data management and how to carry outvalidity checks.

Data Checking

NPMs were responsible for ensuring that manychecks of the quality of their country’s data weremade before the data files were submitted to theConsortium. The checking procedures requiredthat the List of Sampled Schools and the StudentTracking Form for each school were alreadyaccurately completed and entered intoKeyQuest®. Any errors had to be correctedbefore the data were submitted. Copies of thecleaning reports were to be submitted togetherwith the data files. More details on the cleaningsteps are provided in Chapter 11.

Data Submission

Files to be submitted included:• data for the test booklets and context

questionnaires;• data for the international option instrument(s)

if used;• data for the multiple-marking study;• List of Sampled Schools; and• Student Tracking Forms.

Hard or electronic copies of the last two itemswere also required.

After Data Were Submitted

NPMs were required to designate a datamanager who would work actively with theConsortium’s data processing centre at ACERduring the international data cleaning process.Responses to requests for information by theprocessing centre were required within threeworking days of the request.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

84

Quality Monitoring

Adrian Harvey-Beavis and Nancy Caldwell

It is essential that a high profile and expensiveproject such as PISA be undertaken with highstandards. This requires not only that proceduresbe carefully developed, but that they be carefullymonitored to ensure that they are in fact fullyand completely implemented. Should it happenthat they were not implemented fully, it is neces-sary to understand to what extent they were not,and the likely implications of this for dataquality. This is the task of quality monitoring.

Quality Monitoring in PISA is, therefore,about observing the extent to which data arecollected, retrieved, and stored according to theprocedures described by the field operationsmanuals. Quality control is embedded in thefield operations procedures. For PISA 2000,Quality Monitors were appointed to do thisobserving, but the responsibility for qualitycontrol resided with the National ProjectManagers (NPMs) who were to implement thefield operation guidelines and thus establishquality control.

A program of national centre and school visitswas central to ensuring full, valid implementationof PISA procedures. The main aims of the site visitprogram were to forestall operational problemsand to ensure that the data collected in differentcountries were comparable and of the highestquality.

There were two levels of quality monitoring inthe PISA project: • National Centre Quality Monitors (NCQMs)—

To observe how PISA field operations were beingimplemented at the national level, Consortiumrepresentatives monitored the national centresby visiting NPMs in each country just prior to

the field trial and just prior to the main study.• School Quality Monitors (SQMs)—Employed by

the Consortium, but located in participatingcountries, SQMs visited a sample of schools torecord how well the field operations guidelineswere followed. They visited a small number ofschools for the field trial (typically around fivein each country) and around 30 to 40 schoolsfor the main study.

Preparation of QualityMonitoring Procedures

National Centre Quality Monitors (NCQMs)were members of the Consortium. They metprior to the field trial to develop the instrumentsneeded to collect data for reporting purposes. Astandardised interview schedule was developedfor use in discussions with the NPM. Thisinstrument addressed key aspects of proceduresdescribed in the NPM Manual and other topicsrelevant to considering data quality, for example,relations between participating schools and thenational centre.

A School Quality Monitor Manual was preparedfor the field trial and later revised for the mainstudy. This manual outlined the duties of SQMs,including con-fidentiality requirements. Itcontained the Data Collection Sheet to be usedby the SQM when visiting a school. Aninterview schedule to be used with the SchoolCo-ordinator was also included. Additionally,for the field trial, a short interview to get studentfeedback on questionnaires was included.

Chapter 7

Implementing QualityMonitoring Procedures

Training National Centre QualityMonitors and School QualityMonitors

The Consortium organised training sessions forthe National Centre Quality Monitors (NCQMs),who trained the School Quality Monitors (SQMs).

As part of their training, NCQMs received anoverview of the design and purpose of theproject, which gave special attention to theresponsibilities of NPMs in conducting the studyin their country.

The Manual for National Centre QualityMonitors was used for the training session, andNCQMs were trained to use a schedule forinterviewing NPMs.

School Quality Monitors were trained toconduct on-site quality monitoring in schoolsand to prepare a report on the school visits. TheSchool Quality Monitor Manual was used fortheir training sessions.

National Project Managers were asked to:• nominate individuals who could fulfil the role

of SQM, who were then approved by theConsortium. Where nominations were notapproved, new nominations were solicited;and

• provide the SQMs with the schedule fortesting dates and times and other details suchas name of school, location, contact details,name of School Co-ordinator and TestAdministrator (if the two were different).

National Centre Quality Monitoring

For both the field trial and the main study,NCQMs visited all national centres, includingnational centres where a Consortium memberwas based.1

The visits to the NPMs took place about amonth before testing started. This meant thatprocedures could be reviewed when possible tomake minor adjustments. It also meant that the

PIS

A 2

00

0 t

ech

nic

al r

epor

t

86

NCQMs could train SQMs not too long beforethe testing. National Centre Quality Monitorsasked NPMs about (i) implementing proceduresat the national centre and (ii) any difficultiessuggesting that changes may be needed to thefield operations.

Questions on procedures were focused on:• how sampling procedures were implemented;• communication between national centres and

participating schools;• translating, printing and distribution of

materials to schools; • management procedures for materials at the

national centre;• planning for data preparation, data checking

and analysis;• data collection quality control procedures,

particularly documentation of data collectionand data entry; and

• data management procedures.While visiting NPMs, the NCQMs collected

copies of manuals and tracking forms used ineach country, that were subsequently reviewedand evaluated.

School Quality Monitoring

National Project Managers (NPMs) were askedto nominate SQMs and advised to consider thatpersons nominated:• should be knowledgeable, or might reasonably

be expected to easily learn about, theprocedures and materials of PISA;

• must speak fluently the language of thecountry and either English or French;

• should have an education or assessmentbackground;

• may be Consortium staff or individualsspecifically selected to undertake schoolmonitoring;

• need to be approved by the Consortium andwork independently of NPMs;

• need to be available to attend a trainingsession; and

• may undertake monitoring in their owncountry or exchange with another countrythrough a reciprocal agreement.For the main study, NPMs were also advised

that it was preferable that an SQM trained forthe field trial should also undertake school visitsduring the main study in 2000.

The Consortium paid SQMs’ expenses and fees.

1 To ensure independence of the Quality Monitorsfrom the centres they visited it was decided, for thefield trial, that nationals should not visit their owncentre. For the main study, this was relaxed, as therewas no evidence in the field trial of problems amongthese national centres.

For school visits, the field trial provided theopportunity to observe and judge:• the quality of the Test Administrator’s

Manual;• the clarity of instructions used by the Test

Administrators; and• whether any of the procedures or

documentation used in the main study shouldbe modified.For the main study, school visits focused

explicitly on observing the implementation ofprocedures, and the impact of the testingenvironment on data. During their visits, SQMs:• reviewed the activities preliminary to the test

administration, such as distribution ofbooklets;

• observed test administration;• verified the school Test Administrator’s record

keeping for example, whether assessedstudents match sampled students;

• asked for comments on procedures andmaterials; and

• completed a report on each school visited,which was returned to the Consortium.The original plan was to have SQMs visit

schools selected at random, but this procedureproved too costly to implement. In all countries,at least one school was randomly selected, andthe remainder were selected with considerationgiven to proximity to the residence of the SQMto contain travel and accommodation costs. Inmany countries, the school selection procedurewas to divide the country into regions with oneSQM responsible for each region. One schoolwas randomly selected in each region, and theremainder selected also with consideration givento proximity to the SQM’s residence. In practice,SQMs often still had to travel large distancesbecause many schools had testing on the same day.

The majority of school visits wereunannounced. Three countries would not permitthis because national policy and othercircumstances (especially, for example, remoteschools) required making prior arrangementswith the schools so they could assist withtransportation and accommodation.

Site Visit Data

Reports on the results of the national centre andschool site visits were prepared by the Consortiumand distributed to national centres after both the

field trial and the main study.For the main study, the national Quality

Monitoring reports were used as part of the dataadjudication process (see Chapter 4). Anaggregated report on quality monitoring is alsoincluded as Appendix 5.

Qu

alit

y m

onit

orin

g

87

Survey Weighting and theCalculation of SamplingVariance

Keith Rust and Sheila Krawchuk

Survey weights were required to analyse PISAdata, to calculate appropriate estimates ofsampling error, and to make valid estimates andinferences. The Consortium calculated surveyweights for all assessed and excluded students,and provided variables in the data that permitusers to make approximately unbiased estimatesof standard errors, to conduct significance testsand to create confidence intervals appropriately,given the sample design for PISA in eachindividual country.

Survey Weighting

Students included in the final PISA sample for agiven country are not all equally representativeof the entire student population, despite randomsampling of schools and students for selectingthe sample. Survey weights must therefore beincorporated into the analysis.

There are several reasons why the surveyweights are not the same for all students in agiven country:• A school sample design may intentionally over

or under-sample certain sectors of the schoolpopulation: in the former case, so that theycould be effectively analysed separately fornational purposes, such as a relatively smallbut politically important province or region,or a sub-population using a particularlanguage of instruction; and in the latter case,for reasons of cost, or other practicalconsiderations,1 such as very small orgeographically remote schools.

section three: data processing

• Information about school size available at thetime of sampling may not have beencompletely accurate. If a school was expectedto be very large, the selection probability wasbased on the assumption that only a sampleof its students would be selected for PISA. Butif the school turned out to be quite small, allstudents would have to be included andwould have, overall, a higher probability ofselection in the sample than planned, makingthese inclusion probabilities higher than thoseof most other students in the sample.Conversely, if a school thought to be smallturned out to be large, the students includedin the sample would have had smallerselection probabilities than others.

• School non-response, where no replacementschool participated, may have occurred,leading to the under-representation of studentsfrom that kind of school, unless weightingadjustments were made. It is also possible thatonly part of the eligible population in a school(such as those 15-year-olds in a single grade)were represented by its student sample, whichalso requires weighting to compensate for themissing data from the omitted grades.

• Student non-response, within participatingschools, occurred to varying extents. Studentsof the kind that could not be givenachievement test scores (but were notexcluded for linguistic or disability reasons)

1 Note that this is not the same as excluding certainportions of the school population. This also happenedin some cases, but cannot be addressed adequatelythrough the use of survey weights.

Chapter 8

will be under-represented in the data unlessweighting adjustments are made.

• Trimming weights to prevent undue influenceof a relatively small subset of the school orstudent sample might have been necessary if asmall group of students would otherwise havemuch larger weights than the remainingstudents in the country. This can lead tounstable estimates—large sampling errors—but cannot be well estimated. Trimmingweights introduces a small bias into estimatesbut greatly reduces standard errors.

• Weights need adjustment for analysingmathematics and science data to reflect the factthat not all students were assessed in each subject.The procedures used to derive the survey

weights for PISA reflect the standards of bestpractice for analysing complex survey data, andthe procedures used by the world’s major statisticalagencies. The same procedures were used in otherinternational studies of educational achievement:the Third International Mathematics and ScienceStudy (TIMSS), the Third InternationalMathematics and Science Study-Repeat (TIMSS-R), the Civic Education Study (CIVED), theProgress in International Reading Literacy Study2001 (PIRLS), which were all implemented bythe International Association for the Evaluationof Educational Achievement (IEA); and theInternational Assessment of Educational Progress(IAEP, 1991). (See Cochran, 1977 and Särndal,Swensson and Wretman, 1992 for the underlyingstatistical theory on survey sampling texts.)

The weight, , for student j in school iconsists of two base weights—the school and thewithin-school—and five adjustment factors, andcan be expressed as:

where: , the school base weight, is given as the

reciprocal of the probability of inclusion ofschool i into the sample;

, the within-school base weight, is given as thereciprocal of the probability of selection ofstudent j from within the selected school i;

is an adjustment factor to compensate fornon-participation by other schools that aresomewhat similar in nature to school i (notalready compensated for by the participationof replacement schools);

is an adjustment factor to compensate for thefact that, in some countries, in some schoolsonly 15-year-old students who were enrolledin the modal grade for 15-year-olds wereincluded in the assessment;

is an adjustment factor to compensate for theabsence of achievement scale scores fromsome sampled students within school i (whowere not excluded);

t1i is a school trimming factor, used to reduceunexpectedly large values of ; and

t2ij is a student trimming factor, used to reducethe weights of students with exceptionallylarge values for the product of all thepreceding weight components.

The School Base Weight

The term is referred to as the school baseweight. For the systematic probability-proportional-to-size school sampling methodused in PISA, this is given as:

.(1)

The term denotes the measure of sizegiven to each school on the sampling frame.Despite country variations, was usuallyequal to the (estimated) number of 15-year-oldsin the school, if it was greater than the predeter-mined target cluster size (35 in most countries).If the enrolment of 15-year-olds was less than theTarget Cluster Size (TCS), then . Inaddition in countries where a stratum of very smallschools was used, if the number of 15-year-oldsin a school was below , then .

The term denotes the samplinginterval used within the explicit samplingstratum g that contains school I and is calculatedas the total of values for all schools instratum g, divided by the school sample size forthat stratum.

Thus, if school i was estimated to have 10015-year-olds at the time of sample selection,

. If the country had a single explicitstratum (g=1) and the total of the valuesover all schools was 150 000, with a school

mos i( )mos i( ) =100

mos i( )

int g / i( )mos i TCS( ) = 2TCS 2

mos i TCS( ) =

mos i( )

mos i( )

wg

mos i mos i gi1

1=

( ) <

int / i int / i( ) ( ) ( )if

otherwise

w i1

w i1

f i2

f ijA

1

f i1

w ij2

w1i

W t f f f t w wij ij i i ijA

i ij i= 2 1 2 1 1 2 1

Wij

PIS

A 2

00

0 t

ech

nic

al r

epor

t

90

sample size of 150, then int(1/i) = 150 00/150=1 000, for school i (and others in the sample),giving wli = 1 000/100 = 10.0. Roughly speaking,the school can be thought of as representingabout 10 schools from the population. In thisexample, any school with 1 000 or more 15-year-old students would be included in thesample with certainty, with a base weight of wli = 1.

The School Weight Trimming Factor

Once school base weights were established for eachsampled school in the country, verifications weremade separately within each explicit samplingstratum to see if the school weights required trim-ming. The school trimming factor t1i, is the ratioof the trimmed to the untrimmed school baseweight, and is equal to 1.0000 for most schoolsand therefore most students, and never exceedsthis value. (See Table 23 for the number of schoolrecords in each country that received some kindof base weight trimming.)

The first check was for schools that had been as-signed very small selection probabilities because ofvery small enrolments, although the school samplingprocedure described in the Sampling Manual andin Chapter 4 of this report was designed to preventthis. Cases did arise where schools had very smallprobabilities of selection relative to others in thesame stratum because they were assigned a valueof that was much smaller than ,usually because the actual enrolment was used asthe value of , even when it was much smallerthan . Where the value for TCS was 35,as in most jurisdictions, if a school with one15-year-old enrolled student was given of 1 rather than 17.5 (or 35, depending on whetheror not a small school stratum was used), the schoolbase weight was very large. The sampled students inthat school would have received a weight 35 timesgreater than that of a student in the sample froma school with 35 15-year-old students.

These schools were given a compromiseweight neither excessively large nor introducinga bias by under-representing students from verysmall schools (that often constituted a sizeablefraction of the PISA student population). This wasdone by essentially replacing the inappropriate

with TCS/6, giving them and their studentsweights six times as great as that of most studentsin the country and three times as great as theywould have had if the small school sampling

option been implemented correctly. Permittingthese schools to have weights three times as greatas they would have had, had they been sampledcorrectly, was also consistent with the othertrimming procedures, described below.

The second school-level trimming adjustmentwas applied to schools that turned out to be muchlarger than was believed at the time of sampling—where 15-year-old enrolment exceeded

. For example, if TCS = 35,then a school flagged for trimming had morethan 105 PISA-eligible students, and more thanthree times as many students as was indicated onthe school sampling frame. Because the studentsample size was set at TCS regardless of the actualenrolment, the student sampling rate was muchlower than anticipated during the school sampling.This meant that the weights for the sampledstudents in these schools would have been morethan three times greater than anticipated whenthe school sample was selected. These schoolshad their school base weights trimmed by having

replaced by in theschool base weight formula.

The Student Base Weight

The term is referred to as the student baseweight. With the PISA procedure for samplingstudents, did not vary across students (j)within a particular school i. Thus is given as:

, (2)

where is the actual enrolment of 15-year-olds in the school (and so, in general, issomewhat different from the estimated ),and is the sample size within school i. Itfollows that if all students from the school wereselected, then for all eligible students inthe school. For all other cases .

School Non-response Adjustment

In order to adjust for the fact that those schoolsthat declined to participate, and were notreplaced by a replacement school, were not ingeneral typical of the schools in the sample as awhole, school-level non-response adjustmentswere made. Several groups of somewhat similarschools were formed within a country, and within

w ij2 1>w ij2 1=

sam i( )mos i( )

enr i( )

w enr isam iij2 = ( )

( )

w ij2

w ij2

w ij2

3× ( )max , ( )TCS mos imos i( )

3× ( )max , ( )TCS mos i

mos i( )

mos i( )

TCS 2mos i( )

TCS 2mos i( )

Surv

ey w

eigh

tin

g an

d t

he

calc

ula

tion

of

sam

pli

ng

vari

ance

91

each group the weights of the responding schoolswere adjusted to compensate for the missingschools and their students. The compositions ofthe non-response groups varied from country tocountry, but were based on cross-classifying theexplicit and implicit stratification variables usedat the time of school sample selection. Usually,about 10 to 15 such groups were formed withina given country depending upon schooldistribution with respect to stratificationvariables. If a country provided no implicitstratification variables, schools were divided intothree roughly equal groups, within each stratum,based on their size (small, medium or large). Itwas desirable to ensure that each group had atleast six participating schools, as small groupscan lead to unstable weight adjustments, whichin turn would inflate the sampling variances.However, it was not necessary to collapse cellswhere all schools participated, as the school non-response adjustment factor was 1.0 regardless ofwhether cells were collapsed or not. Adjustmentsgreater than 2.0 were flagged for review, as theycan cause increased variability in the weights,and lead to an increase in sampling variances. Ineither of these situations, cells were generallycollapsed over the last implicit stratificationvariable(s) until the violations no longer existed.In countries with very high overall levels ofschool non-response after school replacement, the

requirement for school non-response adjustmentfactors all to be below 2.0 was waived.

Within the school non-response adjustmentgroup containing school i, the non-responseadjustment factor was calculated as:

, (3)

where the sum in the denominator is over Γ(i),the schools within the group (originals and replace-ments) that participated, while the sum in thenumerator is over Ω(i), those same schools, plusthe original sample schools that refused and werenot replaced. The numerator estimates thepopulation of 15-year-olds in the group, whilethe denominator gives the size of the populationof 15-year-olds directly represented by participatingschools. The school non-response adjustment factorensures that participating schools are weighted torepresent all students in the group. If a school didnot participate because it had no eligible studentsenrolled, no adjustment was necessary since thiswas neither non-response nor under-coverage.

Table 22 shows the number of school non-response classes that were formed for each country,and the variables that were used to create the cells.

f

w enr k

w enr ki

kk i

kk i

1

1

1

=

( )

( )∈ ( )

∈ ( )

∑∑

Ω

Γ

PIS

A 2

00

0 t

ech

nic

al r

epor

t

92

Table 22: Non-Response Classes

Country Variables Used to Create Non-Response Classes Original FinalNumber Numberof Cells of Cells After

Collapsing Small Cells

Australia Metropolitan/Country 42 24

Austria No school non-response adjustments

Belgium (Fl.) For strata 1-3, School Type and Organisation 27 8

Belgium (Fr.) 5 categories of the school proportion of overage students 14 4

Brazil Public/Private, Region, Score Range 104 50

Canada Public/Private, Urban/Rural 134 64

Czech Republic 3 School Sizes 65 65

Denmark First 2 digits of concatenated School Type and County 78 24

Surv

ey w

eigh

tin

g an

d t

he

calc

ula

tion

of

sam

pli

ng

vari

ance

93

Table 22 (cont.)

Country Variables Used to Create Non-Response Classes Original FinalNumber Numberof Cells of Cells After

Collapsing Small Cells

England For very small schools: School Type, Region; 50 17For other schools: School Type, Exam Results (LEA or Grant Maintained schools) or Co-educational Status (Independent Schools), Region

Finland No school non-response adjustments

France 3 School Sizes 18 10

Germany For strata 66 and 67, State was used. For all others, 60 29the explicit stratum was used.

Greece School Type 30 14

Hungary Combinations of School Type, Region, and Location 100 31(75 values)

Iceland 3 School Sizes 21 14

Ireland School Type, Gender Composition 16 9

Italy No school non-response adjustments

Japan 3 School Sizes 12 10

Korea No school non-response adjustments

Latvia Urbanisation, School Type 29 10

Liechtenstein No school non-response adjustments

Luxembourg 3 School Sizes 6 4

Mexico No school non-response adjustments

Netherlands 3 School Sizes 15 7

New Zealand Public/Private, School Gender composition, 15 9Socio-Economic Status, Urban/Rural

Northern Ireland School Type, Examination Results, Region 30 14

Norway 3 School Sizes 15 9

Poland 3 School Sizes 9 9

Portugal First 2 digits of Public/Private, 3 digits of Region and 22 11the Index of Social Development for 3 regions.

Russian Federation School Program, Urbanisation 47 46

Scotland For strata 1, Average School Achievement and the 49 11first 2 digits of the National ID; for strata 2, the first 2 digits of the National ID.

Spain No school non-response adjustments

Sweden Combinations of the implicit stratification variables 15 11

Switzerland Pseudo Sampling Strata and 3 School Sizes 49 37

United States Public/Private, High/Low Minority and 101 18Private School Type, PSU

PIS

A 2

00

0 t

ech

nic

al r

epor

t

94

Grade Non-response Adjustment

In a few countries, several schools agreed to participate in PISA but required that participation berestricted to 15-year-olds in the modal grade for 15-year-olds, rather than all 15-year-olds, because ofperceived administrative inconvenience. Since the modal grade generally included the majority of thepopulation to be covered, some of these schools were accepted as participants. For the part of the15-year-old population in the modal grade, these schools were respondents, while for the rest of thegrades in the school with 15-year-olds, this school was a refusal. This situation occasionally arose for agrade other than the modal grade because of other reasons, such as other testing being carried out forcertain grades at the same time as the PISA assessment. To account for this, a special non-responseadjustment was calculated at the school level for students not in the modal grade (and was automatically1.0 for all students in the modal grade).

Within the same non-response adjustment groups used for creating school non-response adjustmentfactors, the grade non-response adjustment factor for all students in school i, , is given as:

(4)

The variable is the approximate number of 15-year-old students in school k but not in themodal grade. The set B(i) is all schools that participated for all eligible grades (from within the non-response adjustment group with school (i)), while the set C(i) includes these schools and those that onlyparticipated for the modal responding grade.

This procedure gave a single grade non-response adjustment factor for each school, which dependedupon its non-response adjustment class. Each individual student received this factor value if they did notbelong to the modal grade, and 1.0000 if they belonged to the modal grade. In general, this factor is notthe same for all students within the same school.

enra k( )

f1ijA =

w1kenra k( )k ∈ C i( )∑

w1kenra k( ).k∈Β i( )∑ for students not in the modal grade

1 otherwise

f ijA

1

Student Non-response Adjustment

Within each participating school, the studentnon-response adjustment was calculated as:

, (5)

where the set is all assessed students in theschool and the set is all assessed students inthe school plus all others who should have beenassessed (i.e., who were absent, not excluded orineligible).

In most cases, this student non-responsefactor reduces to the ratio of the number ofstudents who should have been assessed to thenumber who were assessed. In some cases ofsmall schools (fewer than 15 respondents), itwas necessary to collapse schools together, andthen the more complex formula above applied.

Additionally, an adjustment factor greater than2.0 was not allowed for the same reasons notedunder school non-response adjustments. If thisoccurred, the school with the large adjustmentwas collapsed with the next school in the sameschool non-response cell.

Some schools in some countries had very lowstudent response levels. In these cases it wasdetermined that the small sample of assessedstudents was potentially too biased as arepresentation of the school to be included in thePISA data. For any school where the studentresponse rate was below 25 per cent, the schoolwas therefore treated as a non-respondent, andits student data were removed. In schools withbetween 25 and 50 per cent student response, thestudent non-response adjustment described abovewould have resulted in an adjustment factor ofbetween 2.0000 and 4.0000, and so these schoolswere collapsed with others to create student non-response adjustments.

X i( )∆ i( )

f 2i =

f1iw1iw2ikk∈Χ i( )∑

f1iw1iw2ikk ∈∆ i( )∑

f i2

Surv

ey w

eigh

tin

g an

d t

he

calc

ula

tion

of

sam

pli

ng

vari

ance

95

(Chapter 12 describes these schools as beingtreated as non-respondents for the purpose ofresponse rate calculation, even though theirstudent data were used in the analyses.)

Trimming Student Weights

This final trimming check was used to detectstudent records that were unusually largecompared to those of other students within thesame explicit stratum. The sample design wasintended to give all students from within thesame explicit stratum an equal probability ofselection and therefore equal weight, in theabsence of school and student non-response. Asalready noted, inappropriate school samplingprocedures and poor prior information about thenumber of eligible students in each school couldlead to substantial violations of this principle.Moreover, school, grade, and student non-response adjustments, and, occasionally,inappropriate student sampling could, in a fewcases, accumulate to give a few students in thedata relatively very large weights, which addsconsiderably to sampling variance. The weightsof individual students were therefore reviewed,and where the weight was more than four timesthe median weight of students from the sameexplicit sampling stratum, it was trimmed to beequal to four times the median weight for thatexplicit stratum.

The student trimming factor t2ij is equal to theratio of the final student weight to the studentweight adjusted for student non-response, andtherefore equal to 1.0000 for the great majorityof students. The final weight variable on thedata file was called w_fstuwt, which is the finalstudent weight that incorporates any student-level trimming. Table 23 shows the number ofstudents with weights trimmed at this point inthe process (i.e., ) for each countryand the number of schools for which the schoolbase weight was trimmed (i.e., ).t i1 1 0000< .

t ij2 1 0000< .

Table 23: School and Student Trimming

Country Schools StudentsTrimmed Trimmed

Australia 1 0

Austria 45 0

Belgium (Fl.) 0 0

Belgium (Fr.) 0 0

Brazil 2 0

Canada 2 0

Czech Republic 1 0

Denmark 1 0

England 0 0

Finland 0 0

France 0 0

Germany 0 0

Greece 1 0

Hungary 2 0

Iceland 0 0

Ireland 0 0

Italy 1 0

Japan 0 37

Korea 1 4

Latvia 1 0

Liechtenstein 0 0

Luxembourg 0 0

Mexico 1 23

Netherlands 0 0

New Zealand 0 0

Northern Ireland 0 0

Norway 0 0

Poland 1 0

Portugal 0 0

Russian Federation 3 39

Scotland 0 0

Spain 5 11

Sweden 0 0

Switzerland 2 0

United States 0 0

Subject-specific Factors forMathematics and Science

The weights described above are appropriate foranalysing data collected from all assessed students—i.e., questionnaire and reading achievement data.Because a special booklet (SE or Booklet 0) wasused in some countries for certain kinds ofstudents and not at random, additional weightingfactors are required to analyse data obtainedfrom only a subset of the 10 PISA test booklets,particularly for analysing mathematics andscience scale scores. These additional weightingfactors were calculated and included with thedata.

The mathematics weight factor was given as:1.0 for each student assigned Booklet 0; 1.8 foreach student assigned Booklet 1, 3, 5, 8, or 9;and 0.0 for each student assigned Booklet 2, 4,6, or 7, which contained no mathematics items.

The science weight factor was given as: 1.0for each student assigned Booklet 0; 1.8 for eachstudent assigned Booklet 2, 4, 6, 8, or 9; and 0.0for each student assigned Booklet 1, 3, 5, or 7,which contained no science items.

Calculating SamplingVariance

To estimate the sampling variances of PISAestimates, a replication methodology wasemployed. This reflected the variance inestimates due to the sampling of schools andstudents. Additional variance due to the use ofplausible values from the posterior distributionsof scaled scores was captured separately,although computationally the two componentscan be carried out in a single program, such asWesVar 4 (Westat, 2000).

The Balanced Repeated ReplicationVariance Estimator

The approach used for calculating samplingvariances for PISA is known as BalancedRepeated Replication (BRR), or Balanced Half-Samples; the particular variant known as Fay’smethod was used. This method is very similar innature to the jackknife method used in previousinternational studies of educational achievement,such as TIMSS. It is well documented in thesurvey sampling literature (see Rust, 1985; Rustand Rao, 1996; Shao, 1996; Wolter, 1985). The

major advantage of BRR over the jackknife isthat the jackknife method is not fullyappropriate for use with non-differentiablefunctions of the survey data, most noticeablyquantiles. It provides unbiased estimates, but notconsistent ones. This means that, dependingupon the sample design, the variance estimatorcan be very unstable, and despite empiricalevidence that it can behave well in a PISA-likedesign, theory is lacking. In contrast BRR doesnot have this theoretical flaw. The standard BRRprocedure can become unstable when used toanalyse sparse population subgroups, but Fay’smodification overcomes this difficulty, and iswell justified in the literature (Judkins, 1990).

The BRR approach was implemented asfollows, for a country where the student samplewas selected from a sample of rather than allschools:• Schools were paired on the basis of the

explicit and implicit stratification and frameordering used in sampling. The pairs wereoriginally sampled schools, or pairs thatincluded a participating replacement if anoriginal refused. For an odd number ofschools within a stratum, a triple was formedconsisting of the last school and the pairpreceding it.

• Pairs were numbered sequentially, 1 to H,with pair number denoted by the subscript h.Other studies and the literature refer to suchpairs as variance strata or zones, or pseudo-strata.

• Within each variance stratum, one school (thePrimary Sampling Unit, PSU) was randomlynumbered as 1, the other as 2 (and thethird as 3, in a triple), which defined thevariance unit of the school. Subscript j refersto this numbering.

• These variance strata and variance units (1, 2,3) assigned at school level are attached to thedata for the sampled students within thecorresponding school.

• Let the estimate of a given statistic from thefull student sample be denoted as . This iscalculated using the full sample weights.

• A set of 80 replicate estimates, (where truns from 1 to 80), was created. Each of thesereplicate estimates was formed by multiplyingthe sampling weights from one of the twoprimary sampling units (PSUs) in each stratumby 1.5, and the weights from the remaining

Xt*

X*

PIS

A 2

00

0 t

ech

nic

al r

epor

t

96

PSUs by 0.5. The determination as to whichPSUs received inflated weights, and whichreceived deflated weights, was carried out in asystematic fashion, based on the entries in aHadamard matrix of order 80. A Hadamardmatrix contains entries that are +1 and –1 invalue, and has the property that the matrix,multiplied by its transpose, gives the identitymatrix of order 80, multiplied by a factor of 80.Examples of Hadamard matrices are given inWolter (1985).

• In cases where there were three units in atriple, either one of the schools (designated atrandom) received a factor of 1.7071 for agiven replicate, with the other two schoolsreceiving factors of 0.6464, or else the oneschool received a factor of 0.2929 and theother two schools received factors of 1.3536.The explanation of how these particularfactors came to be used is explained inAppendix 12.

• To use a Hadamard matrix of order 80 requiresthat there be no more than 80 variance stratawithin a country, or else that some combiningof variance strata be carried out prior toassigning the replication factors via theHadamard matrix. The combining of variancestrata does not cause any bias in varianceestimation, provided that it is carried out insuch a way that the assignment of varianceunits is independent from one stratum toanother within strata that are combined. Thatis, the assignment of variance units must becompleted before the combining of variancestrata takes place, and this approach was usedfor PISA.

• The reliability of variance estimates forimportant population subgroups is enhancedif any combining of variance strata that isrequired is conducted by combining variancestrata from different subgroups. Thus in PISA,variance strata that were combined wereselected from different explicit sampling strataand, to the extent possible, from differentimplicit sampling strata also.

• In some countries, it was not the case that theentire sample was a two-stage design, of firstsampling schools and then sampling students.In some countries for part of the sample (andfor the entire samples for Iceland,Liechtenstein and Luxembourg), schools wereincluded with certainty into the sampling, so

that only a single stage of student samplingwas carried out for this part of the sample. Inthese cases instead of pairing schools, pairs ofindividual students were formed from withinthe same school (and if the school had an oddnumber of sampled students, a triple ofstudents was formed also). The procedure ofassigning variance units and replicate weightfactors was then conducted at the studentlevel, rather than at the school level.

• In contrast, in a few countries there was astage of sampling that preceded the selectionof schools, for at least part of the sample.This was done in a major way in the RussianFederation and the United States, and in amore minor way in Germany and Poland. Inthese cases there was a stage of sampling thattook place before the schools were selected.Then the procedure for assigning variancestrata, variance units and replicate factors wasapplied at this higher level of sampling. Theschools and students then inherited theassignment from the higher-level unit in whichthey were located.

• The variance estimator is then:

. (6)

The properties of BRR have been establishedby demonstrating that it is unbiased and consistentfor simple linear estimators (i.e., means fromstraightforward sample designs), and that it hasdesirable asymptotic consistency for a widevariety of estimators under complex designs, andthrough empirical simulation studies.

Reflecting Weighting Adjustments

This description glosses over one aspect of theimplementation of the BRR method. Weights fora given replicate are obtained by applying theadjustment to the weight components that reflectselection probabilities (the school base weight inmost cases), and then re-computing the non-response adjustment replicate by replicate.

Implementing this approach required that theConsortium produce a set of replicate weights inaddition to the full sample weight. Eighty suchreplicate weights were needed for each student inthe data file. The school and student non-response adjustments had to be repeated foreach set of replicate weights.

V X X XBRR tt

∗ ∗ ∗

=

( ) = −( ) ∑0 05 2

1

80

.

Surv

ey w

eigh

tin

g an

d t

he

calc

ula

tion

of

sam

pli

ng

vari

ance

97

To estimate sampling errors correctly, theanalyst must use the variance estimation formulaabove, by deriving estimates using the t-thset of replicate weights instead of the full sampleweight. Because of the weight adjustments (andthe presence of occasional triples), this does notmean merely increasing the final full sampleweights for half the schools by a factor of 1.5and decreasing the weights from the remainingschools by a factor of 0.5. Many replicateweights will also be slightly disturbed, beyondthese adjustments, as a result of repeating thenon-response adjustments separately byreplicate.

Formation of Variance Strata

With the approach described above, all originalsampled schools were sorted in stratum order(including refusals and ineligibles) and paired, bycontrast to other international educationassessments such TIMSS and TIMSS-R that havepaired participating schools only. However, thesestudies did not use an approach reflecting theimpact of non-response adjustments on samplingvariance. This is unlikely to be a big componentof variance in any PISA country, but theprocedure gives a more accurate estimate ofsampling variance.

Countries Where All Students WereSelected for PISA

In Iceland, Liechtenstein and Luxembourg, alleligible students were selected for PISA. It mightbe considered surprising that the PISA datashould reflect any sampling variance in thesecountries, but students have been assigned tovariance strata and variance units, and the BRRformula does give a positive estimate ofsampling variance for three reasons. First, ineach country there was some student non-response, and, in the case of Luxembourg, someschool non-response. Not all eligible studentswere assessed, giving sampling variance. Second,only 55 per cent of the students were assessed inmathematics and science. Third, the issue is tomake inference about educational systems andnot particular groups of individual students, so itis appropriate that a part of the samplingvariance reflect random variation betweenstudent populations, even if they were to be

subjected to identical educational experiences.This is consistent with the approach that isgenerally used whenever survey data are used totry to make direct or indirect inference aboutsome underlying system.

Xt*

PIS

A 2

00

0 t

ech

nic

al r

epor

t

98

Scaling PISA Cognitive Data

Ray Adams

The mixed coefficients multinomial logit modelas described by Adams, Wilson and Wang(1997) was used to scale the PISA data, andimplemented by ConQuest software (Wu, Adamsand Wilson, 1997).

The Mixed CoefficientsMultinomial Logit Model

The model applied to PISA is a generalised formof the Rasch model. The model is a mixedcoefficients model where items are described bya fixed set of unknown parameters ξ, while thestudent outcome levels (the latent variable), θ, isa random effect.

Assume that I items are indexed with each item admitting Ki + 1 responsecategories indexed . Use the vectorvalued random variable,

where

(7)

to indicate the Ki + 1 possible responses to item i.

A vector of zeroes denotes a response incategory zero, making the zero category areference category, which is necessary for modelidentification. Using this as the referencecategory is arbitrary, and does not affect thegenerality of the model. The Xi can also becollected together into the single vector

, called the response vector

(or pattern). Particular instances of each of these

random variables are indicated by their lowercase equivalents; x, xi and xik.

Items are described through a vector, of p parameters. Linear

combinations of these are used in the responseprobability model to describe the empiricalcharacteristics of the response categories of eachitem. Design vectors ,each of length p, which can be collected to form a design matrix

define these linear combinations.The multi-dimensional form of the model

assumes that a set of D traits underlies theindividuals’ responses. The D latent traits definea D-dimensional latent space. The vector

, represents an individual’sposition in the D-dimensional latent space.

The model also introduces a scoring functionthat allows the specification of the score orperformance level assigned to each possibleresponse category to each item. To do so, thenotion of a response score bijd is introduced,which gives the performance level of an observedresponse in category j, item i, dimension d. Thescores across D dimensions can be collected intoa column vector andagain collected into the scoring sub-matrix foritem i, and then into ascoring

matrix for the entire test.

(The score for a response in the zero category iszero, but other responses may also be scoredzero).

B B B B= ( )1 2

T TIT T

, , ,K

B b b bi i i iDT= ( )1 2, , ,K

bik ik ik ikDT

b b b= ( )1 2, , ,K

θθ =( )′θ θ θ1 2, , ,K D

A a a a a a aT

K K IK I= ( )11 12 1 21 21 2

, , , , , , , ,K K K

a ij ii I j K, , , ; ,= =( )1 1K K

ξξT

p= ( )ξ ξ ξ1 2, , ,K

X X X XT T T

IT= ( )1 2, , ,K

Xi j

ij =

1

0

if response to item is in category

otherwise

,

X i i i iK

TX X X

i= ( )1 2, , , ,K

k Ki= 0 1, , ,K

i I=1, ,K

Chapter 9

The probability of a response in category j of item i is modelled as:

(8)

For a response vector

(9)

with , (10)

where Ω is the set of all possible response vectors.

The Population Model

The item response model is a conditional model, in the sense that it describes the process of generatingitem responses conditional on the latent variable, θ. The complete definition of the model, therefore,requires the specification of a density, , for the latent variable, θ. Let α symbolise a set ofparameters that characterise the distribution of θ. The most common practice, when specifying uni-dimensional marginal item response models, is to assume that students have been sampled from a normalpopulation with mean µ and variance σ2. That is:

, (11)

or equivalently, (12)

where .

Adams, Wilson and Wu (1997) discuss how a natural extension of (11) is to replace the mean, µ, withthe regression model, , where is a vector of u, fixed and known values for student n, and β is thecorresponding vector of regression coefficients. For example, Yn could be constituted of student variablessuch as gender or socio-economic status. Then the population model for student n becomes:

, (13)

where it is assumed that the En are independently and identically normally distributed with mean zeroand variance σ2 so that (13) is equivalent to:

, (14)

a normal distribution with mean and variance σ2. If (14) is used as the population model then theparameters to be estimated are β, σ2 and ξ.

The generalisation needs to be taken one step further to apply it to the vector valued θ rather than thescalar valued θ. The extension results in the multivariate population model:

, (15)

where γ is a u×d matrix of regression coefficients, Σ is a d×d variance-covariance matrix and Wn is a u×1vector of fixed variables.

In PISA, the Wn variables are referred to as conditioning variables.

f n nd

n nT

n nθθ−−11θθ γγ ΣΣ ΣΣ θθ γγ ΣΣ θθ γγ; , , exp — —W W W( ) = ( ) − ( ) ( )

− −2

1

22 1

2π

YnTββ

f n n n nT T

n nT

θ θ σ πσσ

θ θ; , , exp — —Y Y Yb 2 2 1 2

221

2( ) = ( ) − ( ) ( )

−ββ ββ

θn nT

nE= +Y ββ

YnYnTββ

E N~ ,0 2σ( )θ µ= + E

f fθθ θθ αα; ; , exp( ) ≡ ( ) = ( ) −−( )

−

θ θ µ σ πσθ µ

σ2 2

12

2

222

fθθ θθ αα;( )

ΨΩ

θθ ξξ θθ ξξ, exp( ) = +( )[ ]

∈

−

∑ z B Az

T

1

f x x B A; | , exp ,ξξ θθ θθ ξξ θθ ξξ( ) = ( ) ′ +( )[ ]Ψ

Pr ; , , |exp

exp

.X A Bb a

b a

ijij ij

ik ikk

K=( ) =+ ′( )

+ ′( )∑1

1

ξξ θθθθ ξξ

θθ ξξ=

i

PIS

A 2

00

0 t

ech

nic

al r

epor

t

100

Combined Model

In (16), the conditional item response model (12) and the population model (15) are combined to obtainthe unconditional, or marginal, item response model:

. (16)

It is important to recognise that under this model, the locations of individuals on the latent variablesare not estimated. The parameters of the model are γ, Σ and ξ.

The procedures used to estimate model parameters are described in Adams, Wilson and Wu (1997),Adams, Wilson and Wang (1997), and Wu, Adams and Wilson (1997).

For each individual it is possible however to specify a posterior distribution for the latent variable,given by:

(17)

hf f

f

f f

f f

n n nn n n n

n n

n n n n

n n n n

n

θθθθ

θθ

θθθθ

θθ ξξ γγ ΣΣξξ θθ θθ γγ ΣΣ

ξξ γγ ΣΣ

ξξ θθ θθ γγ ΣΣ

ξξ θθ θθ γγ ΣΣ

; , , , |; | ; , ,

; , , ,

; | ; , ,

; | ; , , .

W xx W

x W

x W

x W

x

x

x

x

( ) =( ) ( )

( )

=( ) ( )( ) ( )∫

f f f dx xx x; , ; | ; ,ξξ,, γγ ΣΣ ξξ θθ θθ γγ ΣΣ θθθθθθ

( ) = ( ) ( )∫

Scal

ing

PIS

A c

ogn

itiv

e d

ata

101

Application to PISA

In PISA, this model was used in three steps:national calibrations; international scaling; andstudent score generation.

For both the national calibrations and theinternational scaling, the conditional itemresponse model (9) is used in conjunction withthe population model (15), but conditioningvariables are not used. That is, it is assumed thatstudents have been sampled from a multivariatenormal distribution.

The PISA model is five-dimensional, made upof three reading, one science and onemathematics dimension. The design matrix waschosen so that the partial credit model was usedfor items with multiple score categories and thesimple logistic model was fit to thedichotomously scored items.

National Calibrations

National calibrations were performed separatelycountry-by-country using unweighted data. Theresults of these analyses, which were used tomonitor the quality of the data and to makedecisions regarding national item treatment, aregiven in Chapter 13.

The outcomes of the national calibrationswere used to make a decision about how to treateach item in each country. This means that: an

item may be deleted from PISA altogether if ithas poor psychometric characteristics in morethan eight countries (a dodgy item); it may beregarded as not-administered in particularcountries if it has poor psychometriccharacteristics in those countries but functionswell in the vast majority of others; or an itemwith sound characteristics in each country butwhich shows substantial item-by-countryinteractions may be regarded as a different item(for scaling purposes) in each country (or insome subset of countries) that is, the difficultyparameter will be free to vary across countries.

Both the second and third options have thesame impact on comparisons between countries.That is, if an item is identified as behavingdifferently in different countries, choosing eitherthe second or third option will have the sameimpact on inter-country comparisons. The choicebetween them could, however, influence within-country comparisons.

When reviewing the national calibrations,particular attention was paid to the fit of theitems to the scaling model, item discriminationand item-by-country interactions.

Item Response Model Fit (Infit MeanSquare)

For each item parameter, the ConQuest fit meansquare statistic index (Wu, 1997) was used to

provide an indication of the compatibility of themodel and the data. For each student, the modeldescribes the probability of obtaining the differentitem scores. It is therefore possible to compare themodel prediction and what has been observed forone item across students. Accumulating comparisonsacross cases gives us an item-fit statistic.

As the fit statistics compare an observed valuewith a predicted value, the fit is an analysis ofresiduals. In the case of the item infit meansquare, values near one are desirable. An infitmean square greater than one is often associatedwith a low discrimination index, and an infitmean square less than one is often associatedwith a high discrimination index.

Discrimination Coefficients

For each item, the correlation between thestudents’ scores on that item and their aggregatescores on the set for the same domain andbooklet as the item of interest was used as anindex of discrimination. If pij (= xij/ mI) is theproportion of score levels that student i achieved

on item j, and pI , (where the summation

is of the items from the same booklet anddomain as item j) is the sum of the proportionsof the maximum score achieved by student i,then the discrimination is calculated as theproduct-moment correlation between pij and pifor all students. For multiple-choice and short-answer items, this index will be the usual point-biserial index of discrimination.

The point-biserial index of discrimination fora particular category of an item is a comparisonof the aggregate score between students selectingthat category and all other students. If thecategory is the correct answer, the point-biserialindex of discrimination should be higher than0.25. Non-key categories should have a negativepoint-biserial index of discrimination. The point-biserial index of discrimination for a partialcredit item should be ordered, i.e., categoriesscored 0 should be lower than the point-biserialcorrelation of categories scored 1, and so on.

Item-by-Country Interaction

The national scaling provides nationally specificitem parameter estimates. The consistency of itemparameter estimates across countries was of partic-ular interest. If the test measured the same latent

trait per domain in all countries, then items shouldhave the same relative difficulty, or, more precisely,would fall within the interval defined by thestandard error on the item parameter estimate.

National Reports

After national scaling, five reports were returnedto each participating country to assist inreviewing their data with the Consortium:1

• Report 1 presented the results of a basic itemanalysis in tabular form. For each item, thenumber of students, the percentage ofstudents and the point-biserial correlationwere provided for each valid category.

• Report 2 provided, for each item and for eachvalid category, the point-biserial correlationand the student-centred Item Response Theory(IRT) ability average in graphical form.

• Report 3 provided a graphical comparison ofthe Item Infit Mean Square coefficients andthe item discrimination coefficients computedat national and international levels.

• Report 4 provided a graphical comparison ofboth the item difficulty parameter and theitem thresholds,2 computed at national andinternational levels.

• Report 5 listed the items that National ProjectManagers (NPMs) needed to check formistranslation and/or misprinting, referred toas dodgy items.

Report 1: Descriptive Statistics on IndividualItems in Tabular Form

A detailed item-by-item report was provided intabular form showing the basic item analysisstatistics at the national level (see Figure 12 foran example).

The table shows each possible responsecategory for each item. The second columnindicates the score assigned to the differentcategories. For each category, the number andpercentage of students responding is shown,along with the point-biserial correlation and theassociated t statistic. Note that for the item inthe example the correct answer is ‘4’, indicatedby the ‘1’ in the score column; thus the point-biserial for a response of ‘4’ is the item’sdiscrimination index, also shown along the top.

= ∑ pijj

PIS

A 2

00

0 t

ech

nic

al r

epor

t

102

1 In addition, two reports showing results from theStudent and School Questionnaires were also returnedto participants.2 A threshold for an item score is the point on thescale at which the probability of a response at thatscore or higher becomes greater than 50 per cent.

2

1

0

-1

-2

0.6

0.30.0

-0.3-0.6

0 1 2 3 4 5 6 7 8 9 R

0 1 2 3 4 5 6 7 8 9 R

0 17 127 102 1053 0 0 0 2 57 14 (no. of students)

0 1 9 7 77 0 0 0 0 4 1 (%)Average Ability by Category

Item: 1M033Q01

Discrimination: 0.39

Key: 4

PointBiserial byCategory

The report shows two kinds of missing data:row 9 indicates students who omitted the itembut responded validly to at least one subsequentitem; row r shows students who did not reachthis item.

Report 2: Descriptive Statistics on IndividualItems in Graphical Form

Report 2 (see Figure 13) graphs the abilityaverage and the point-biserial correlation bycategory. Average Ability by Category iscalculated by domain and centred for each item.This makes it easy to identify positive andnegative ability categories, so that checks can be

made to ensure that, for multiple-choice items,the key category has the highest average abilityestimate, and for constructed-response items, themean abilities are ordered consistently with thescore levels. The displayed graphs also facilitatethe process of identifying the following anomalies:• a non-key category with a positive point-biserial

or a point-biserial higher than the key category;• a key category with a negative point-biserial;

and• for partial-credit items, average abilities (and

point-biserials) not increasing with the scorepoints.

Scal

ing

PIS

A c

ogn

itiv

e d

ata

103

Item 1------item:1 (M033Q01)Cases for this item 1372 Discrimination is 0.39-------------------------------------------------------------------- Label Score Count % of total Pt Bis t

-------------------------------------------------------------------- 0 0 0.00 NA NA 1 0.00 17 1.28 -0.08 -3.14 2 0.00 127 9.57 -0.19 -7.15 3 0.00 102 7.69 -0.12 -4.57 4 1.00 1053 79.35 0.39 16.34 5 0 0.00 NA NA 6 0 0.00 NA NA 7 0 0.00 NA NA 8 0.00 2 0.15 0.02 0.61 9 0.00 57 4.30 -0.22 -8.47 r 0.00 14 1.06 -0.24 -9.15

====================================================================

Figure 12: Example of Item Statistics Shown in Report 1


Report 3: Comparison of National andInternational Infit Mean Square andDiscrimination Coefficients

The national scaling provided the infit meansquare, the point-biserial correlation, the itemparameter estimate (or difficulty estimate) andthe thresholds for each item in each country.Reports 3 (see Figure 14) and 4 (see Figure 15)compare the value computed for one countrywith those computed for all other countries andwith the value computed at international levelfor each item.

The black crosses present the values of thecoefficients computed from the international

database. Shaded boxes represent the mean plusor minus one standard deviation of thesenational values. Shaded crosses represent thevalues for the national data set of the country towhich the report was returned.

Substantial differences between the nationaland international value on one or both of theseindices show that the item is behaving differentlyin that country. This might reflect amistranslation or another problem specific to thenational version, but if the item wasmisbehaving in all or nearly all countries, itmight reflect a specific problem in the sourceitem and not with the national versions.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

104



Delta (Item Difficulty)

-1.5 0.0 1.5

Item-Category Threshold

-1.5 0.0 1.5 (value)

M033Q01

Threshold No. 1

X

X X

X

-1.51

-1.84

-1.51

-1.84

International

National value Summary over all nationalvalues (mean +/- 1 SD)

Delta Infit Mean Square Discrimination Index

0.70 1.00 1.30 (value) 0.00 0.25 0.50 (value)

M033Q01 X

X

X

X1.13

1.10 0.39

0.39

National value Summary over all nationalvalues (mean +/- 1 SD)

International

Report 4: Comparison of National andInternational Item Difficulty Parameters andThresholds

Report 4 presents the item difficulty parametersand the thresholds, in the same graphic form asReport 3. Substantial differences between thenational value and the international value (i.e.,the national value mean) might be interpretedlike an item-by-country interaction.Nevertheless, appropriate estimates of the item-by-country interaction are provided in Report 5.

Report 5: National Dodgy Item ReportFor each country’s dodgy items, Report 5 listswhere the items were flagged for one or more ofthe following reasons: difficulty is significantlyeasier or harder than average; a non-keycategory has a point-biserial correlation higherthan 0.05 if at least 10 students selected it; thekey category point-biserial correlation is lowerthan 0.25; the categories abilities for partialcredit items are not ordered; and/or the infitmean square is higher than 1.20 or lower than0.8. An example extract is shown in Figure 16.

International Calibration

International item parameters were set byapplying the conditional item response model (9)in conjunction with the multivariate populationmodel (15), without using conditioningvariables, to a sub-sample of students. This sub-sample of students, referred to as theinternational calibration sample, consisted of

Scal

ing

PIS

A c

ogn

itiv

e d

ata

105

Item by Country Interactions Discrimination Fit

Largelow discr.

item

Smallhigh discr.

item

Ability notOrdered

Key Point-Biserial

Correlation isNegative

Non-keyPoint-BiserialCorrelation is

Positive

Harderthan

Expected

Easierthan

Expected

M033Q01

M037Q02T

M124Q01

13 500 students comprising 500 students drawnat random from each of the 27 participatingOECD countries that met the PISA 2000response rate standards (see Chapter 13). Thisexcluded any SE booklet students and wasstratified by linguistic community within thecountry.

The allocation of each PISA item to one of thefive PISA 2000 scales is given in Appendix 1 (forreading), Appendix 2 (for mathematics) andAppendix 3 (for science).

Student Score Generation

As with all item response scaling models, studentproficiencies (or measures) are not observed;they are missing data that must be inferred fromthe observed item responses. There are severalpossible alternative approaches for making thisinference. PISA used two approaches: maximumlikelihood, using Warm’s (1985) WeightedLikelihood Estimator (WLE), and plausiblevalues (PVs). The WLE proficiency makes theactual score that the student attained the mostlikely. PVs are a selection of likely proficienciesfor students that attained each score.

Computing Maximum LikelihoodEstimates in PISA

Six weighted likelihood estimates were providedfor each student, one each for mathematicalliteracy, reading literacy and scientific literacyand one for each of the three reading literacy,


sub-scales. These can be treated as (essentially) unbiased estimates of student abilities, and analysed usingstandard methods.

Weighted maximum likelihood ability estimates (Warm, 1985) are produced by maximising (9) withrespect to , that is, solving the likelihood equations:

(18)

for each case, where are the item parameter estimates obtained from the international calibration andd indicates the latent dimensions. is the test information for student n on dimension d, and isthe first derivative with respect to θ. These equations are solved using a routine based on the Newton-Raphson method.

Plausible Values

Using item parameters anchored at their estimated values from the international calibration, the plausiblevalues are random draws from the marginal posterior of the latent distribution, (15), for each student.For details on the uses of plausible values, see Mislevy (1991) and Mislevy et al. (1992).

In PISA, the random draws from the marginal posterior distribution (17) are taken as follows.

M vector-valued random deviates, , from the multivariate normal distribution, ,

for each case n.3 These vectors are used to approximate the integral in the denominator of (17), using theMonte-Carlo integration:

(19)

At the same time, the values:

(20)

are calculated, yielding the set of pairs , which can be used as an approximation of

the posterior density (17); and the probability that could be drawn from this density is given by:

(21)

At this point, L uniformly distributed random numbers are generated; and for each randomdraw, the vector, , that satisfies the condition:

(22)

is selected as a plausible vector.

q qsns

i

i sns

i

=

−

=∑ ∑< ≤

1

1

1

0 0

η

ϕϕni0

η i i

L =1

qp

pnj

mn

mnm

M=

=∑

1

.

ϕϕnj

ϕϕmnmn

m

Mp, ℑ

=1

p f fmn n mn mn n= ( ) ( )x x W; | ; , ,ξξ ϕϕ ϕϕ γγ ΣΣθθ

f f dM

f mnm

M

x xx x; | , ( ; | ) .ξξ θθ θθ,, γγ ΣΣ θθ ξξ ϕϕθθθθ

( ) ( ) ≈ ≡ ℑ∫ ∑=

1

1

f n nθθ θθ ;; ,, γγ ΣΣW ,( )ϕϕmn m

M =1

JndInd

ξξ

bixni−

bij exp bij sndθnd + ′a ij ξ( )exp biksndθnd + ′a ikξ( )

k =1

Ki

∑j =1

K i

∑ +J nd

2I nd

i∈Ω∑

d∈ D∑ = 0,

θθn

PIS

A 2

00

0 t

ech

nic

al r

epor

t

106

3 The value M should be large. In PISA, M = 2000 was used.

Constructing ConditioningVariables

The PISA conditioning variables are preparedusing procedures based on those used in theUnited States National Assessment ofEducational Progress (Beaton, 1987) and inTIMSS (Macaskill, Adams and Wu, 1998). Thesteps involved in this process are as follows:

• Step 1. Each variable in the StudentQuestionnaire was dummy coded according tothe coding presented in Appendix 8.4

• Step 2. For each country, a principalcomponents analysis of the dummy-codedvariables was performed, and componentscores were produced for each student (asufficient number of components to accountfor 90 per cent of the variance in the originalvariables).

• Step 3. Using item parameters anchored attheir international location and conditioningvariables derived from the national principalcomponents analysis, the item-response modelwas fit to each national data set and thenational population parameters γ and Σ wereestimated.5

• Step 4. Five vectors of plausible values weredrawn using the method described above.

Analysis of Data withPlausible Values

It is very important to recognise that plausiblevalues are not test scores and should not betreated as such. They are random numbersdrawn from the distribution of scores that couldbe reasonably assigned to each individual—thatis, the marginal posterior distribution (17). Assuch, plausible values contain random errorvariance components and are not optimal asscores for individuals. 6 Plausible values as a setare better suited to describing the performanceof the population. This approach, developed byMislevy and Sheehan (1987, 1989) and based onthe imputation theory of Rubin (1987), produces

Scal

ing

PIS

A c

ogn

itiv

e d

ata

107

consistent estimators of population parameters.Plausible values are intermediate values

provided to obtain consistent estimates ofpopulation parameters using standard statisticalanalysis software such as SPSS® and SAS®. Asan alternative, analyses can be completed usingConQuest (Wu, Adams and Wilson, 1997).

The PISA student file contains 30 plausiblevalues, five for each of the five PISA 2000cognitive scales and five for the combinedreading scale. PV1MATH to PV5MATH are fivefor mathematical literacy; PV1SCIE to PV5SCIEfor scientific literacy, and PV1READ toPV5READ for combined reading literacy. Forthe three reading literacy sub-scales, retrievinginformation, interpreting texts and reflection andevaluation, the plausible values variables arePV1READ1 to PV5READ1, PV1READ2 toPV5READ2 and PV1READ3 to PV5READ3,respectively.

If an analysis were to be undertaken with oneof these five cognitive scales or for the combinedreading scale, then it would ideally be undertakenfive times, once with each relevant plausiblevalues variable. The results would be averaged,and then significance tests adjusting for variationbetween the five sets of results computed.

More formally, suppose that is astatistic that depends upon the latent variableand some other observed characteristic of eachstudent. That is: where are the values of the latentvariable and the other observed characteristic forstudent n. Unfortunately θn is not observed.However, the item responses, xn, from which themarginal posterior can beconstructed for each student n, are observed. If

is the joint marginal posteriorfor n=1,...,N then:

(23)

can be computed.The integral in (23) can be computed using

the Monte-Carlo method. If M random vectorsare drawn from

(23) is approximated by:

, (24)

r* X, Y( ) ≈ 1

Mr θm , Y( )

m =1

M

∑

=1

Mrm

m =1

M

∑

hθθ θθ ξξ γγ ΣΣ; , , , |Y X( ) Θ Θ Θ1 2, , ,K M( )

r E r

r h d

* *

; , , , |

X,Y ,Y X,Y

,Y Y X

( ) = ( )[ ]= ( ) ( )∫

θθ

θθ θθ ξξ γγ ΣΣ θθθθθθ

hθθ θθ ξξ γγ ΣΣ; , , , |Y X( )

h yn n nθθ θθ ξξ γγ ΣΣ; , , , | x( )

θn ny,( )θθ, , , , ,..., ,Y( ) = ( )θ θ θ1 1 2 2y y yN N

r θθ,Y( )

4 With the exception of gender and ISEI.5 In addition to the principal components, gender,ISEI and school mean performance were added asconditioning variables.6 Where optimal might be defined, for example, aseither unbiased or minimising the mean squared errorat the student level.

where is the estimate of r computed usingthe m-th set of plausible values.

From (23) we can see that the final estimateof r is the average of the estimates computedusing each plausible value in turn. If Um is thesampling variance for then the samplingvariance of is:

, (25)

where and

.

An α-% confidence interval for is:

where is the s

percentile of the t-distribution with υ degrees of

freedom. ,

and d is the degrees of

freedom that would have applied had θn beenobserved. In PISA, d will vary by country andhave a maximum possible value of 80.

f M B VM M= +( )−1 1

υ =

−+

−( )1

1

12 2f

M

f

dM M

t sυ ( )r t V* ± −( )

υ

α12

12

r*

BM

r rM mm

M

=−

−( )=

∑1

1

2

1

ˆ *

UM

Umm

M* =

=∑1

1

V U M BM= + +( )−* 1 1

r*rm

rm

PIS

A 2

00

0 t

ech

nic

al r

epor

t

108

Coding and MarkerReliability Studies

As explained in the first section of this report,on Test Design (see Chapter 2), a substantialproportion of the PISA 2000 items were open-ended and required marking by trained markers(or coders). It was important therefore that PISAimplemented procedures that maximised thevalidity and consistency (both within andbetween countries) of this marking.

Each country coded items on the basis ofMarking Guides prepared by the Consortium(see Chapter 2) using the marking designdescribed in Chapter 6. Training sessions to traincountries in the use of the Marking Guides wereheld prior to both the field trial and the mainstudy.

This chapter describes three aspects of thecoding and marking reliability studiesundertaken in conjunction with the field trialand the main study. These are the homogeneityanalyses undertaken with the field trial data toassist the test developers in constructing valid,reliable scoring rubrics; the variance componentanalyses undertaken with the main study data toexamine within-country rater reliability; and anInter-country Reliability Study undertaken toexamine the between-country consistency inapplying the Marking Guides.

Examining Within-country Variability inMarking

Norman Verhelst

To obtain an estimate of the between-markervariability within each country, multiple markingwas required for at least some student answers.

Therefore, it was decided that multiple markingswould be collected for all open-ended items inboth the field trial and the main study for amoderate number of students. In the main study,either 48 or 72 students’ booklets were multiplymarked, depending on the country. Therequirement was that the same four expertmarkers per domain (reading, mathematics andscience) should mark all items appearingtogether in a test booklet. A booklet containing,for example, 15 reading items, would give athree-dimensional table for reading (48 or 72students by 15 items by 4 markers), where eachcell contains a single category. For each domainand each booklet, such a table was producedand processed in several analyses, which aredescribed later. These data sets were requiredfrom each participating country.

The field trial problems were quite differentfrom those in the main study. In the field trial,many more items were tried than were used inthe main study. One important purpose of thefield trial was to select a subset of items to beused in the main study. One obvious concernwas to ensure that markers agreed to areasonable degree in their categorisation of theanswers. More subtle problems can arise,however. In the final administration of a test, astudent answer is scored numerically. But in theconstruction phase of an item, more than tworesponse categories may be provided, say A, Band C, and it may not always be clear how theseshould be converted to numerical scores. Thetechnique used to analyse the field trial data canprovide at least a partial answer, and also givean indication of the agreement between markers

Chapter 10

treatment of homogeneity analysis, seeNishisato, 1980; Gifi, 1990; or Greenacre,1984.) Although observations may be coded asdigits, these digits are considered as labels, notnumbers. To have a consistent notational system,it is assumed in the sequel that the responsecategories of item i are labelled . Themain purpose of the analysis is to convert thesequalitative observations into (quantitative) datawhich are in some sense optimal.

The basic loss functionThe first step in the analysis is to define a set of(binary) indicator variables that contain all theinformation of the original observations, definedby:

, (26)

where it is to be understood that can takeonly the values 0 and 1.

The basic principle of homogeneity analysis isto assign a number to each student (thestudent score on item i), and a number toeach observation , called the categoryquantification, such that student scores are insome way the best summary of all categoryquantifications that apply to them. Tounderstand this in more detail, consider thefollowing loss function:

. (27) F g x yi ivm iv im

mv

= −∑∑∑ l l

l

( )2

Oivm

yiml

xiv

givml

O givm vim= ⇔ =l l 1

1, , , ,K lK Li

for each item separately. The technique is calledhomogeneity analysis. It is important to notethat the data set for this analysis is treated as acollection of nominal or qualitative variables.

In the main study, the problem was different.The field trial concluded with a selection of adefinite subset of items and a scoring rule foreach. The main problem in the main study wasto determine how much of the total variance ofthe numerical test scores could be attributed tovariability across markers. The basic data set tobe analysed therefore consisted of a three-dimensional table of numerical scores. Thetechnique used is referred to as variancecomponent analysis.

Some items in the field trial appeared tofunction poorly because of well-identified defects(e.g., poor translation). To get an impression ofthe differences in marker variability betweenfield trial and main study, most field trial dataanalyses were repeated using the main studydata. Some comparisons are reported below.

This chapter uses a consistent notationalsystem, summarised in Figure 17.

Nested data structures are occasionallyreferred to, as every student and every markerbelong to a single country. In such cases, theindices m and v take the subscript c.

Homogeneity Analysis

In the analysis, the basic observation is thecategory into which marker m places theresponse of student v on item i, denoted Oivm.Basic in the approach of homogeneity analysis isto consider observations as qualitative ornominal variables. (For a more mathematical

PIS

A 2

00

0 t

ech

nic

al r

epor

t

110

Symbol Range Meaning

i 1,...,I item

c 1,...,C country

Vc Number of students from country c

v 1,..., Vc student

Mc Number of markers from country c

m 1,...,M= marker

1,...,Li category of item i

Figure 17: Notational System

l

M c∑

The data are represented by indicator variables. If marker m has a good idea of the

potential of student v, and thinks it appropriateto assign him/her to category , then, ideally,one would expect that , yielding a zeroloss for that case. But the same marker can haveused the same category for student , who hasa different potential ( ), and since thequantification is unique, there cannot be azero loss in both cases. Thus, some kind ofcompromise is required, which is made byminimising the loss function Fi.

Four observations have to be made inconnection with this minimisation: • The loss function (27) certainly has no unique

minimum, because adding an arbitraryconstant to all and all leaves Fiunaltered. This means that in some way anorigin of the scale must be chosen. Althoughthis origin is arbitrary, there are sometheoretical advantages in defining it throughthe equality:

. (28)

• If for all v and all m and one choosesthen Fi = 0 (and (28) is fulfilled),

which is certainly a minimum. Such asolution, where all variability in the studentscores and in the category quantifications issuppressed, is called a degenerate solution. Toavoid such degeneracy, and at the same timeto choose a unit of the scale, requires therestriction:

. (29)

Restrictions (28) and (29) jointly guarantee thata unique minimum of Fi exists and correspondsto a non-degenerate solution except in somespecial cases, as discussed below.

• Notice that in the loss function, missingobservations are taken into account in anappropriate way. From definition (26) itfollows that if is missing, = 0 for all

, such that a missing observation nevercontributes to a positive loss.

• A distinct loss function is minimised for eachitem. Although other approaches tohomogeneity analysis are possible, the presentone serves the purpose of item analysis well.The data pertaining to a single item areanalysed separately, requiring no assumptions

on their relationships. A later subsectionshows how to combine these separate analysesto compare markers and countries. Another way to look at homogeneity analysis

is to arrange the basic observations (for a singleitem i) in a table with rows corresponding tostudents and columns corresponding to markers,as in the left panel of Figure 18. The results canbe considered as a transformation of theobservations into numbers, as shown in the rightpanel of the figure. At the minimum of the lossfunction, the quantified observations havethe following interesting properties. The totalvariance can be partitioned into three parts: onepart attributable to the columns (the markers),another part attributable to the rows (thestudents), and a residual variance. At thesolution point it holds that:

(30)

If the markers agree well among themselves,Var(residuals) will be a small proportion of thetotal variance, meaning that the markers arevery homogeneous. The index of homogeneity isdefined therefore as:

, (31)

at the point where Fi attains its minimum. Thesubscript c has been added to indicate that thisindex of homogeneity can only be meaningfullycomputed within a single country, as explainedin the next sub-section.

HVar students

Var students Var residualsic =( )

( ) + ( )

Var students is

Var

Var residuals is

( )( ) =( )

maximised

markers

minimised

0,

yiml

l givmlOivm

112

Vxiv

v

V

∑ =

x yiv im= =l 0 l

xivv

=∑ 0

yimlxiv

yiml

xiv'

′υ

x yiv im= l

l

givml

Cod

ing

and

mar

ker

reli

abil

ity

stu

die

s

111

1... m ...M 1... m ...M

1 1

v v

V V

Figure 18: Quantification of Categories

L L L L L L

M M M M M M M M

L yiml L→ L Oivm = l L

M M M M M M M M

L L L L L L

.

,

The indices Hic can be compared meaningfullywith each other, and across countries and items,because they are all proportions of varianceattributable to the same source (students),compared to the total variance attributable tostudents and markers. The differences betweenthe Hic-indices, therefore, must be attributed tothe items, and can therefore be used as aninstrument for item analysis. Items with a highHic-index are less susceptible to markervariation, and therefore the scores obtained onthem are more easily generalisable acrossmarkers.

Degenerate and quasi-degenerate solutionsAlthough restriction (29) was introduced toavoid degenerate solutions, it is not alwayssufficient, and the data collection design or somepeculiarities in the collected data can lead toother kinds of degeneration. First, degeneracydue to the design is discussed.

Using the notational conventions explained inthe introduction, the loss function (27) can, inthe case of the data of all countries analysedjointly, be written as:

.(32)

To see the degeneracy clearly, suppose C = 2,V1 = V2 and there are no missing responses. Thevalue Fi = 0 (and thus Hi) can be reached asfollows: , and

. This solution complies with(28) and (29), but it can easily be seen that itdoes nothing else than maximise the variance ofthe x’s between countries and minimise thevariance within countries. Of course, one couldimpose a restriction analogous to (29) for eachcountry, but then (32) becomes C independentsums, and the scores (x-values) are no longercomparable across countries. A meaningfulcomparison across countries requires imposingrestrictions of another kind, described in thenext subsection.

In some cases, this kind of degeneracy mayoccur also in data sets collected in a completedesign, but with extreme patterns of missingobservations. Suppose some student has got acode from only one marker and assume,moreover, that this marker used this code only

once. A degenerate solution may then occurwhere this student is contrasted with all others(collapsed into a single point), much in the sameway as in the example above.

Similar cases may occur when a certain code, say, is used a very few times. Assume this

code is used only once, by marker m. Bychoosing a very extreme value for , asituation may occur where the student givencode by marker m tends to contrast with allthe others, although fully collapsing them maybe avoided (because this student’s score is‘pulled’ towards the others by the codes receivedfrom the other markers). But the general resultwill be one where the solution is dominated bythis very infrequent coding by one marker. Suchcases may be called quasi-degenerate and areexamples of chance capitalisation. They areprone to occur in small samples of students,especially in cases where there are manydifferent categories as in the field trial,especially with items with multiple-answercategories. Cases of quasi-degeneracy give aspuriously high Hi-index, and one should be verycareful not to cherish such an item too much,because it might show very poor performance ina cross-validation.

Quasi-degeneracy is an intuitive notion that isnot rigorously defined, and will continue to be amajor source of concern, although adequaterestrictions on the model parameters usuallyaddress the problem.

To develop good guidelines for selecting agood test from the many items used in the fieldtrial, one should realise that a low homogeneityindex points to items that will introduceconsiderable variability into the test scorebecause of rater variance, and may therefore bestbe excluded from the definitive test. But an itemwith a high index is not necessarily a good item.Quasi-degeneracy will tend to occur in caseswhere one or more response categories are usedvery infrequently. It might therefore be useful todevelop a device that can simultaneously judgehomogeneity and the risk of quasi-degeneracy.

Homogeneity Analysis with Restrictions

Apart from cases of quasi-degeneracy, there isanother reason for imposing restrictions on themodel parameters in homogeneity analysis. Thespecific value of Hi obtained from the

l

yimcl

l

y xim ivc cl l= (all )

x civc= − =1 2 if x civc

= =1 1 if

F g x yi iv m iv im

L

m

M

v

V

c

C

c c c c

i

c

c

c

c

= −∑∑∑∑ l l

l

( )2

PIS

A 2

00

0 t

ech

nic

al r

epor

t

112

minimisation of (27) can only be attained if thequantification issued from the homogeneityanalysis is indeed used in applications, i.e., whenthe score obtained by student v on item i whencategorised by marker m in category is equalto the category quantification . But thismeans that the number of points to be earnedfrom receiving category may differ acrossmarkers. An extra difficulty arises when newmarkers are used in future applications. Thiswould imply that in every application a newhomogeneity analysis has to be completed. Andthis is not very attractive for a project like PISA,where the field trial is meant to determine a(more or less) definite scoring rule, whereby thesame scoring applies across all raters and countries.

Restrictions within countriesAs a first restriction, one might wish for novariation across markers within the same countryso that for each country c the restriction:

(33)

is imposed. But since students and markers arenested within countries, this amounts tominimising the loss function for each country:

, (34)

such that the overall loss function is minimisedautomatically to:

. (35)

To minimise (34), technical restrictions similarto (28) and (29) must hold per country.

As in the case without restrictions,homogeneity indices can be computed in thiscase also (see equation (31)), and will be denoted

. It is easy to understand that for all itemsand all countries the inequality:

(36)

must hold.In contrast to some indices to be discussed

further, -indices are not systematicallyinfluenced by the numbers of students or raters.If students and raters within a country can beconsidered to be a random sample of somepopulations of students and markers, the

computed indices are a consistent (and unbiased)estimate of their population counterparts. Thenumber of students and markers influence onlythe standard error.

The -indices can be used for a doublepurpose:• Comparing with Hic within countries: if

for some item, is much lower than Hic,this may point to systematic differences ininterpretation of the Marking Guides amongmarkers. Suppose, as an example, a binaryitem with categories A and B, and all markersin a country except one agree perfectly amongthemselves in their assignment of students tothese categories; one marker, however,disagrees perfectly with the others in assigningto category A where the others choose B, andvice versa. Since each marker partitions allstudents in the same two subsets, the index ofhomogeneity Hic for this item will be one, butthe category quantifications of the outlyingmarker will be different from those of theother markers. Requiring that the categoryquantifications be the same for all markerswill force some compromise, and the resulting

will be lower than one.• Differences between and Hic can also be

used to compare different countries amongeach other (per item). Differences, especially ifthey are persistent across items, make itpossible to detect countries where the markingprocess reveals systematic disagreementamong the markers.

Restrictions within and across countriesOf course, different scoring rules for differentcountries are not acceptable for the main study.More restrictions are therefore needed toascertain that the same category shouldcorrespond to the same item score (quantification)in each country. This amounts to a furtherrestriction on (33), namely:

, (37)

leading to the loss function:

, (38)

and a corresponding index of homogeneity,denoted as .Hi

**

F g x yi iv m

L

m

M

v

V

c

C

iv ic c

i

c

c

c

c

c

** **( )= −∑∑∑∑ l

l

l2

y y c C Lic i il l K l K= = =, ( , , ; , , )1 1

Hic*

Hic*

Hic*

Hic*

Hic*

Hic*

H Hic ic* ≤

Hic*

F Fi icc

* *= ∑

F g x yic iv m iv

L

m

M

v

V

icc c c

i

c

c

c

c

* *( )= −∑∑∑ l

l

l2

y y m M Lim ic c icl l K l K= = =, ( , , , , , ) 1 1

l

yiml

l

Cod

ing

and

mar

ker

reli

abil

ity

stu

die

s

113

To provide an impression of the use of theseindices, a summary plot for the open-endeditems in science is displayed in Figure 19. Theline shown in the chart connects the -indicesfor all items (sorted in decreasing order). Aninterval is plotted, for each item, based on the

-indices of 24 countries. For these countries,the -indices were sorted in ascending orderand the endpoints of the intervals correspond tothe 6th and 18th values. This figure may be auseful guide for selecting items for the mainstudy since it allows a selection based on thevalue of , but also shows clear differences inthe variability of the -indices. The first threeitems, for example, have almost equal

-indices, but the second one shows morevariability across countries than the other two,and therefore may be less suitable for the mainstudy. But perhaps the clearest example is thelast one, with the lowest -index and showinglarge variations across countries.

An Additional Criterion forSelecting Items

An ideal situation (from the viewpoint ofstatistical stability of the estimates) occurs wheneach of the Li response categories of an item hasbeen used an equal number of times: for eachmarker separately when one does the analysiswithout restrictions; across markers within acountry when an analysis is done with restrictionswithin a country (restriction (33); or acrossmarkers and countries when restriction (38)applies. If the distribution of the categoriesdeparts strongly from uniformity, cases of quasi-degeneracy may occur: the contrast between asingle category with very small frequency and allthe other categories may tend to dominate thesolution. Very small frequencies are more likelyto occur in small samples than in large samples,hence the greater possibility of being influencedby chance. The most extreme case occurs whenthe whole distribution is concentrated in a single

Hi**

Hi**

Hic*

Hi**

Hic*

Hic*

Hi**

Figure 19: H* and H** for Science Items

114

0.1

0.3

0.5

0.7

0.9

1.1

Inde

x of

Hom

ogen

eity

PIS

A 2

00

0 t

ech

nic

al r

epor

t

Cod

ing

and

mar

ker

reli

abil

ity

stu

die

s

115

category; a homogeneity analysis thus has nomeaning (and technically cannot be carried outbecause normalisation is not defined).

From a practical point of view, an item with adistribution very close to the extreme case of novariability is of little use in testing, but may havea very acceptable index of homogeneity.Therefore it seems wise to judge the quality ofan item by considering simultaneously thehomogeneity index and the form of distribution.

To measure the departure from a uniformdistribution in the case of nominal variables, theindex to be developed must be invariant underpermutation of the categories. For a binary item,for example, the index for an item with p-valuep must be equal to that of an item with p-value1-p.

Pearson’s well known X2 statistic, computedwith all expected frequencies equal to each other,fulfils this requirement, and can be written informula form as:

, (39)

where, in the middle expression, is thefrequency of category , is the averagefrequency and L is the number of categories,and, in the right-hand expression, and

(=1/L) are observed and average proportion,and n is the sample size. This index, however,changes with changing sample size, andcomparison across items (even with constantsample size) is difficult because the maximumvalue increases with the number of categories. Itcan be shown that:

, (40)

where equality is reached only where L-1categories have frequency zero, i.e., the case ofno variability. The index:

(41)

is proposed as an index of departure fromuniformity. Its minimum is zero (uniformdistribution), its maximum is one (no variability).It is invariant under permutation of thecategories and is independent of the sample size.This means that it does not change when all

frequencies are multiplied by a positive constant.Using proportions instead of frequencies ,(41) can be written as:

. (42)

Table 24 gives some distributions and theirassociated values for L = 2 and L = 3. (Rowfrequencies always sum to 100.)

Table 24: Some Examples of

L = 2 L = 3

Frequency Frequency

50 50 0.00 50 50 0 0.25

60 40 0.04 40 30 30 0.01

70 30 0.16 50 25 25 0.06

75 25 0.25 50 48 2 0.22

80 20 0.36 70 15 15 0.30

90 10 0.64 70 28 2 0.35

95 5 0.81 80 10 10 0.49

99 1 0.96 80 18 2 0.51

As an example, the -indices are plotted inFigure 20 against the -values for the 22 open-ended reading items in Booklet 9 of the fieldtrial, using the Australian data. The binary (b)items are distinguished from the items with morethan two categories (p). One can see that mostof the items are situated in the lower right-handcorner of the figure, combining highhomogeneity with a response distribution thatdoes not deviate too far from a uniformdistribution. But some items have highhomogeneity indices and a very skeweddistribution, which may make them less suitablefor inclusion in the main study.

∆Hic

*

∆∆

∆

∆ =

−−∑L

Lp L1

1 2( )l

l

fl pl

∆ =−

X

n L

2

1( )

X n L2 1≤ −( )

p pl

f l fl

X

f f

fn

p p

p

L L2

2 2

=−

=−∑ ∑( ) ( )l

l

l

l

A Comparison Between Field Trial and Main Study

The main study used a selection of items fromthe field trial, but the wording in the item text ofthe Marking Guides was changed in some cases(mainly due to incorrect translations). Thechanged items were not tested in an independentfield trial, however. If they did really representan improvement, their homogeneity indicesshould rise in comparison with the field trial.

Another reason for repeating the homogeneityanalysis in the main study is the relative numbersof students and markers used for the reliabilitystudy in the field trial and in the main study.Small numbers easily give rise to chancecapitalisation, and therefore repeating thehomogeneity analyses in the main study serves asa cross-validation.

Figure 21 shows a scatter-plot of the -indices of the reading items in the Australianfield trial and main study samples, where itemsare distinguished by the reading subscale towhich they belong: 1 = retrieving information,

2 = interpreting texts and3 = reflection and evaluation.

As the dashed line in the figure indicatesequality, it is immediately clear that the -indices for the great majority of items haveincreased in the main study. Another interestingfeature is that the reflection items systematicallyshow the least homogeneity among markers.

Variance Component Analysis

The general approach to estimating thevariability in the scores due to markers isgeneralisability theory. Introductions to theapproach can be found in Cronbach, Gleser,Nanda and Rajaratnam (1972), and in Brennan(1992). In the present section, a shortintroduction is given of the general theory andthe common estimation methods. Ageneralisability coefficient is then derived, as aspecial correlation coefficient, and itsinterpretation is discussed. Finally some specialPISA-related estimation problems are discussed.

Hic*

Hic*

PIS

A 2

00

0 t

ech

nic

al r

epor

t

116

Figure 20: Homogeneity and Departure from Uniformity

0.3 0.5 0.7 0.9H*

0.0

0.2

0.4

0.6

0.8

Del

tabp

Analysis of two-way tablesTo make the notion of generalisability theory clear,a simple case where a number of students answera number of items, and for each answer they geta numerical score which will be denoted as , thesubscript v referring to the student, and the subscripti referring to the item, is described first. These ob-servations can be arranged in a rectangulartable or matrix, and the main purpose of theanalysis is to explain the variability in the table.

Conceptually, the model used to explain thevariability in the table is the following:

, (43)

where is an unknown constant, is the studenteffect, is the item effect, (to be read asa single symbol, not as a product) is the student-item interaction effect and is the measurementerror. The general approach in generalisabilitytheory is to assume that the students in the sampleare randomly drawn from some population, butalso that the items are randomly drawn from apopulation, usually called a universe. This meansthat the specific value can be considered as a

realisation of a random variable, α for example,and similarly for the item effect: is a realisationof the random variable β. Also the interaction effect

and the measurement error are con-sidered as realisations of random variables. So, themodel says that the observed score is a sum of anunknown constant µ and four random variables.

The model as given in (43), however is notsufficient to work with, for several reasons.First, since each student in the sample gives onlya single response to each item in the test, theinteraction effects and the measurement error areconfounded. This means that there is nopossibility to disentangle interaction andmeasurement error. Therefore, they will be takentogether as a single random variable, ε forexample, which is called the residual (and whichis definitely not the same as the measurementerror). This is defined as:

, (44)

and (43) can be rewritten as:

. (45)Yvi v i vi= + + +µ α β ε

ε αβ εvi vi vi= +( ) *

εvi*( )αβ vi

βi

α v

εvi*

( )αβ viβi

α vµ

Yvi v i vi vi= + + + +µ α β αβ ε( ) *

V I×

Yvi

Cod

ing

and

mar

ker

reli

abil

ity

stu

die

s

117

Figure 21: Comparison of Homogeneity Indices for Reading Items in the Main Study and the Field Trial in Australia

3

31

2 1

0.2 0.4 0.6 0.8 1.0

0.4

0.5

0.6

0.7

0.8

0.9

1.0

3

3

2

2

2

3

3

3

3

3

2

2

3

3

1

3

3

3

11

2

2

3

2

3

1

3

3

3

2

2

3

1

3

1

23

1

1

1

2

22

1

2

1

1

21

1

Mai

n St

udy

Field Trial

Second, since the right-hand side of (45) is asum of four terms, and only this sum isobserved, the terms themselves are not identified,and therefore three identification restrictionshave to be imposed. Suitable restrictions are:

. (46)

Third, apart from the preceding restriction,which is of a technical nature, there is oneimportant theoretical assumption: all the non-observed random variables (student effects, itemeffects and residuals) are mutually independent.This assumption leads directly to a rather simplevariance decomposition:

. (47)

The total variance can easily be estimatedfrom the observed data, as well as the constantµ. The first purpose of variance componentanalysis is to obtain an estimate of the threevariance components , and . If the datamatrix is complete, good estimators are given bythe traditional techniques of variance analysis,using the decomposition of the total sum ofsquares SStot as:

, (48)

where row refers to the students and col refersto the items. Dividing each SS by their respectivenumber of degrees of freedom yields thecorresponding so-called mean squares, fromwhich unbiased estimates of the three unknownvariance components can be derived:

, (49)

, (50)

and

. (51)

Usually the exact value of the three variancecomponents will be of little use, but their relativecontribution to the total variance is. Thereforethe variance components will be expressed as apercentage of the total variance (the sum of thecomponents) in what follows.

σ β2 =

−MS MS

Vcol res

σα2 =

−MS MS

Irow res

σε2 = MSres

SS SS SS SStot row col res= + +

σε2σ β

2σα2

σ Y2

σ σ σ σα β εY2 2 2 2= + +

E E E( ) ( ) ( )α β ε= = = 0

PIS

A 2

00

0 t

ech

nic

al r

epor

t

118

The estimators given by (49) through (51)have the attractive property that they areunbiased, but they also have an unattractiveproperty: the results of the formulae in (50) and(51) can be negative. In practice, this seldomoccurs, and if it does, it is common to changethe negative estimates to zero.

Analysis of three-way tablesThe approach for three-way tables is astraightforward generalisation of the case oftwo-way tables. The observed data are nowrepresented by , the score student v gets forhis/her answer on item i when marked bymarker m. The observed data are arranged in athree-dimensional array (a box), where thestudent dimension will be denoted as rows, theitem dimensions as columns and the markerdimensions as layers.

The model is a generalisation of model (45):

(52)

The observed variable is the sum of aconstant, three main effects, three first-orderinteractions and a residual. The residual in thiscase is the sum of the second-order interaction

and the measurement error . Botheffects are confounded because there is only oneobservation in each cell of the three-dimensionaldata array. The same restrictions as in the caseof a two-way table apply: zero mean of the effectsand mutual independence. Therefore the totalvariance decomposes into seven components:

(53)

and each component can be estimated withtechniques similar to those demonstrated in thecase of a two-way table. (The formulae are notdisplayed.)

The risk of ending up with negative estimatesis usually greater for a three-dimensional tablethan for a two-way table. This will be illustratedby one case. The main effect reflects therelative leniency of marker m: marker m isrelatively mild if the effect is positive; relativelystrict if it is negative. It is not too unrealistic toassume that markers differ systematically inmildness, meaning that the variance component

γm

σ σ σ σ

σ σ σ σα β γ

αβ αγ βγ ε

Y2 2 2 2

2 2 2 2

= + +

+ + + +

εvim*( )αβγ vim

Yvim

Yvim v i m

vi vm im vim

= + + ++ + + +

µ α β γαβ αγ βγ ε( ) ( ) ( )

Yvim

.

,

Cod

ing

and

mar

ker

reli

abil

ity

stu

die

s

119

will differ markedly from zero, andconsequently that its estimator will have a verysmall probability of yielding a negative estimate.A positive interaction effect means thatmarker m is especially mild for student v (morethan on average towards the other students),reflecting in some way a positive or negative biasto some students. But if the marking procedurehas been seriously implemented—students notbeing known to the markers—it is to beexpected that these effects will be very small ifthey do exist at all. And this means that thecorresponding variance component will be closeto zero, making a negative estimate quite likely.

CorrelationsIf one wants to determine the reliability of a test,one of the standard procedures is to administer thetest a second time (under identical circumstances),and compute the correlation between the two setsof test scores. This correlation is the reliability ofthe test (by definition). But if all variance com-ponents are known, this correlation can be com-puted from them. This is illustrated here for thecase of a two-way table. To make derivations easy,the relative test scores are used, defined as:

. (54)

Using model (43) and substituting in (54):

. (55)

If the test is administered a second time usingthe same students and the same items, the relativescores on the repeated test will of course havethe same structure as the right-hand side of (55), and, moreover, all terms will be identicalwith the exception of the last one, because thetest replication is assumed to be independent.

To compute the covariance between the two setsof test scores, observe that the mean item effectsin both cases are the same for all students, meaningthat this average item effect does not contributeto the covariance or to the variance. And becausethe measurement error is independent in the twoadministrations, it does not contribute to thecovariance. So the covariance between the twoseries of test scores is:

. (56)

The variance of the test scores is equal in thetwo administrations:

. (57)

Dividing the right-hand sides of (56) and (57)gives the correlation:

(58)

But this correlation cannot be computed froma two-way table, because the interactioncomponent cannot be estimated. It iscommon to drop this term from the numeratorsin (58), giving as a result Cronbach’s alphacoefficient (Sirotnik, 1970):

(59)

where the equality holds if and only if theinteraction component is zero.

Another interesting question concerns thecorrelation one would find if on the two testoccasions an independent set of I items (bothrandomly drawn from the same universe) werepresented to the same students. In this case, themean item effect will contribute to variance ofthe test scores, but not to the covariance, andthe correlation will be given by:

, (60)

which of course cannot be computed eitherbecause interaction and measurement error areconfounded. Dropping the interactioncomponent from the numerators gives:

. (61)ρσ

σσ σ

α

αβ ε

2

2

22 2( , ). .

'Y Y

I

v v ≥

++

ρσ

σ

σσ σ σ

ααβ

αβ αβ ε

2

22

22 2 2( , ). .

'

*

Y Y I

I

v v =+

++ +

σαβ2

alpha =

++

=+

≤

σ

σσ σ

σ

σσ

ρ

α

ααβ ε

α

αε

2

22 2

2

22

1

*

. .'( , )

I

I

Y Yv v

σαβ2

ρ1 (Yv. ,Yv.' ) =

σα2 +

σαβ2

I

σα2 +

σαβ2 + σε*

2

I

Var Y Var YIv v( ) ( ). .

' *= = ++

σσ σ

ααβ ε22 2

Cov Y YIv v( , ). .

' = +σσ

ααβ22

YI I Iv v i

ivi

ivi

i.

*( )= + + + +∑ ∑ ∑µ α β αβ ε1 1 1

YI

Yv vii

. = ∑1

Yv.

γ vm

σ γ2

.

.

In generalisability theory, Cronbach’s alpha isalso called a generalisability coefficient forrelative decisions, while the right-hand side of (61)is called the generalisability coefficient forabsolute decisions.

Now the more interesting question of con-structing a sensible correlation coefficient for theproblem in the PISA study, where there are differentsources of variation—three main effects, threefirst-order interactions and a residual effect—isaddressed. The basic problem is to determine theeffects associated with the markers, or in otherwords what would be the correlation betweentwo series of test scores, observed with the samestudents and the same items but at each replicationmarked by an independent set of M markers,randomly drawn from the universe of markers.The relative test score is now defined as:

. (62)

Of the eight terms on the right-hand side of(52), some are to be treated as constant, somecontribute to the score variance and some to thecovariance and the variance, as displayed inTable 25.

Table 25: Contribution of Model Terms to(Co)Variance

constant

variance and covariance

variance

Using this table, and taking the definition ofthe relative score (62) into account, it is not toodifficult to derive the correlation between thetwo series of test scores:

. (63)

The expression on the right-hand side can beused as a generic formula for computing thecorrelation with an arbitrary number of items,and an arbitrary number of markers.

Estimation with missing data and incomplete designs

The estimation of the variance components withcomplete data arrays is an easy task to carry out.With incomplete designs and missing data,however, the estimation procedure becomesrather complicated, and good software to carryout the estimations is scarce. The only commercialpackage known to the author that can handlesuch problems is BMDP (1992), but the numberof cases it can handle is very limited, and it wasfound to be inadequate for the PISA analyses.

The next section proceeds in three steps. First, theprecise structure of the data that were available foranalysis is discussed; second, parameter estimationin cases where some observations are missing atrandom is addressed; and finally, the estimationprocedure in incomplete designs is described.

The structure of the dataIn the data collection of the main study, itemswere included in three domains of performance:mathematics, science and reading. For reportingpurposes, however, it was decided that three sub-domains in reading would be distinguished:retrieving information, interpreting texts andreflection and evaluation. In the present sectionthese three sub-domains are treated separately,yielding five domains in total.

Data were collected in an incomplete design,using nine different test booklets. One bookletcontained only reading items, while the othershad a mixed content (see Chapter 2).Themultiple marking scheme was designed to havefour markers per booklet−domain combination.For example, four markers did the multiplemarking on all open-ended reading items inBooklet 1, and four markers completed the samefor Booklet 2, but there were no restrictions onthe set of markers for different booklets. Therefore,it could happen that in some countries the set ofreading markers was the same for all booklets,while in other countries different sets, with someor no overlapping, were used for the marking inthe three domains. During the marking processno distinction was made as to the three sub-domains of reading: a marker processed all open-ended reading items appearing in a booklet.1

ρ

σσ

σσ σ σ σ σ

ααβ

ααβ γ αγ βγ ε

3

22

22 2 2 2 2

( , ).. ..'Y Y

I

I M I M

v v =

+

+ ++

++

×

γ αγ βγ ε, ( ), ( ),

α αβ, ( )

µ β,

YI M

Yv vimmi

.. =× ∑∑1

PIS

A 2

00

0 t

ech

nic

al r

epor

t

120

1 See Chapter 6 for a full description of the design.

This marking scheme is not optimal for avariance component analysis, but on top of thatthere were two other complications:• If the marking instructions were followed, we

would have a three-dimensional array perbooklet−domain combination, completelyfilled with scores, and such an array was thefirst unit of analysis in estimating the variancecomponents. In many instances, however,scores were not available in some cells ofthese arrays, causing serious estimationproblems. The way these problems weretackled is explained in the discussion of thesecond step: estimation with missingobservations.

• In some countries, the instruction to use fourmarkers per booklet, thus guaranteeing acomplete data array per booklet−domaincombination, was not followed rigorously. Insome instances more markers were used, eachof them doing part of the work in a ratheruncontrolled way. This caused seriousproblems in the estimation procedure of thevariance components, as explained in thediscussion of the third step: estimation inincomplete designs.

Estimation with missing observationsTo demonstrate the problems in the estimationof variance components, the analysis of a two-way table is used as an example, starting withthe definition of an indicator variable :

, (64)

and defining the number of observations perstudent v and per item i as:

(65)

, (66)

and the total number of observations as:

. (67)

For averages, the dot notation will be used:

, (68)

(69)

and

(70)

All averages are weighted averages, and sincethe weights at the single observation level areeither one or zero, the value of is arbitrary if

is zero.Of course the following can always be

written:

(71)

and for the left-hand expression and each of theexpressions between parentheses on the right-hand side of (71), one can define the weightedsum of squares as:

, (72)

, (73)

, (74)

and

. (75)

But a very nice property, which holds in thecase of complete data, is lost in general. It doesnot hold in general that:

.The estimation procedure in common analysis

of variance proceeds by equating the observedsum of squares (or mean squares) to theirexpected value; so the estimation procedure is amoment estimation. Denote with p one of theelements . With complete data,it turns out that:

, (76)E SS A B Rp p p p( ) = + +σ σ σα β ε2 2 2

tot row col res, , ,

SS SS SS SStot row col res= + +

SS u Y Y Y Yres vi vi v iiv

= − − +∑∑ ( ). . ..2

SS n Y Ycol i ii

= −∑ ( ). ..2

SS t Y Yrow v vv

= −∑ ( ). ..2

SS u Y Ytot vi viiv

= −∑∑ ( )..2

Y Y Y Y Y Y

Y Y Y Yvi v i

vi v i

− = − + −+ − − +

.. . .. . ..

. . ..

( ) ( )

( )

uvi

Yvi

YN

u Y

Nt Y

Nn Y

vi viiv

v vv

i ii

..

.

.

=

=

=

∑∑

∑

∑

1

1

1

Yn

u Yii

vi viv

. = ∑1

Yt

u Yvv

vi vii

. = ∑1

N t nv iiv

= =∑∑

n ui viv

= ∑

t uv vii

= ∑

uY

vivi=

1 if is observed

0 otherwise

uvi

Cod

ing

and

mar

ker

reli

abil

ity

stu

die

s

121

.

,

where the coefficients are verysimple functions of the number of students andthe number of items. The expected valueoperator also has a straightforward meaning,since the sampled students and items areconsidered as a random sample from theirrespective universes.

With missing data, the situation is morecomplicated. The indicator variables are to beconsidered as random variables, and in order todefine expected values of sums of squares, difficultproblems have to be solved, as can easily be seenfrom equation (72), where products of u and Y-variables appear, so that assumptions about thejoint distribution of both kinds of variables haveto be made. But the assumptions in classicalvariance component theory about the Y-variablesare very weak. In fact, the only assumption is thatrow effects, column effects and residuals have afinite variance. Without adding very specificassumptions about the Y-variables, the onlyworkable assumption about the joint distributionof Y and u is that they are independent, or that themissing observations are missing completely atrandom. All analyses carried out on the multiple-

marking data of the main study of PISA make thisassumption. The consequences of the violation ofthese assumptions are outlined below.

With this assumption, the expected valueoperator in (76) is well defined and the generalform of (76) is also valid with missing data,although the coefficients aremore complicated than in the complete case. Thecase of a two-way analysis is given in Table 26.

In the complete case, it holds thatsuch that the coefficients reduce

to a much simpler form, as displayed in Table 27.In Table 27, it is easily checked that the sum of

the last three rows yields the coefficients for. This means that any three of the four

observed sums of squares can be used to solve thethree unknown variance components, and eachsystem will lead to the same solution. With missingdata, however, this feature is lost, and each systemof three equations using the coefficients in Table 26will lead to a different solution, showing thatmoment estimators are not unique. If the modelassumptions are fulfilled, however, the differencesbetween the solutions are reasonably small. (SeeVerhelst (2000) for more details.)

E SStot( )

t I n Vv i= = and

A B Rp p p, and

uvi

A B Rp p p, and

PIS

A 2

00

0 t

ech

nic

al r

epor

t

122

Table 26: Coefficients of the Three Variance Components (General Case)

Student Component Item Component Measurement Error Component

( ) ( ) ( )

N V Iu

t nvi

v iiv

− − − + ∑∑1 2Vn

N

ii−∑ 2

It

N

vv−∑ 2

E SSres( )

I −1Nn

N

ii−∑ 2

It

N

vv−∑ 2

E SScol( )

V −1Vn

N

ii−∑ 2

Nt

N

vv−∑ 2

E SSrow( )

N −1Nn

N

ii−∑ 2

Nt

N

vv−∑ 2

E SStot( )

σε2σ β

2σα2

Cod

ing

and

mar

ker

reli

abil

ity

stu

die

s

123

For three-way tables, a similar reasoning canbe followed, yielding coefficients for the expectedsums of squares for the total, the three maineffects, the three first-order interactions and theresidual; and any subset of seven equations canbe used for a moment estimator of the sevenvariance components. The expressions for thecomponents, however, are all more complicatedthan in the case of two-way tables and are notpresented here. For details, see Longford (1995)and Verhelst (2000).

An extensive simulation study (Verhelst, 2000)has shown that the variance components can beconsistently estimated if indeed the missing valuesare missing completely at random, but if they arenot, the estimation method may produce totallyunacceptable estimates. As an example, themarking procedure for the reading items inBooklet 1 in the United States was examined.This booklet contains 22 open-ended readingitems, and multiple markings were collected for48 students. Following the marking instructions,four markers had to mark all 22 answers for all48 students, yielding 48 × 22 = 1 056 markingsper marker. In the data set, however, ninedifferent marker identifications were found, withthe number of markings per marker rangingfrom 129 to 618. The design with which themarkings were collected is unknown, and canonly be partially reconstructed from the data. InTable 28, the variance component estimates forthe five ‘retrieving information’ open-endeditems in Booklet 1 arising from the abovemethod analysis are reported. The large negativeestimate for the student−marker interactioncomponent is totally unacceptable, and the useof the estimation method in a case such as this isdubious.

Table 28: Estimates of Variance Components (× 100) in the United States

Student Component ( ) 9.8

Item Component ( ) 5.1

Marker Component ( ) 0.4

Student-Item Interaction Component ( ) 10.0

Student-Marker Interaction Component ( ) -12.0

Item-Marker Interaction Component ( ) -0.3

Measurement Error Component ( ) 13.7

The third step in the procedure is the combi-nation of the variance component estimates for asub-domain. The analysis per booklet yields esti-mates per booklet, but the booklet is not interestingas a reporting unit; and so the estimates must becombined in some way across booklets. In thecase of a two-way table, the estimation equationsare (see equation (76)):

, (77)

where b indexes the booklet and p is any elementfrom . If it can be assumed thatstudents, items and markers are randomly assignedto booklets, the summation of (77) across bookletswill yield estimation equations for the variancecomponents which yield consistent estimates.Therefore, the following equations can be used:

. (78)

For the three-way tables used in the presentstudy, similar estimation equations were used,with the right-hand side of (78) having seven

SS A B Rbpb

bpb

bpb

bpb

∑ ∑ ∑ ∑= + +σ σ σα β ε2 2 2

tot row col res, , ,

SS A B Rbp bp bp bp= + +σ σ σα β ε2 2 2

σε2

σ βγ2

σαγ2

σαβ2

σ γ2

σ β2

σα2

Table 27: Coefficients of the Three Variance Components (Complete Case)

Student Component Item Component Measurement Error Component

( ) ( ) ( )

0

0

0 0 N V I− − +1E SSres( )

I −1N V−E SScol( )

V −1N I−E SSrow( )

N −1N V−N I−E SStot( )

σε2σ β

2σα2

unknown variance components instead of three. Thevariance components for the mathematics domainare displayed in Table 29 for the Netherlands.

A number of comments are required withrespect to this table.• The table is exemplary for all analyses in the

five domains for countries where the markinginstructions were rigorously followed. A morecomplete set of results is given in Chapter 14.

• The most comforting finding is that allvariance components where markers areinvolved (the three columns containing thesymbol as a subscript of the variancecomponent) are negligibly small, meaning thatthere are no systematic marker effects.

• The results of Booklet 3 are slightly outlying,yielding a much smaller student effect and amuch larger item effect than the other booklets.This could be attributed to sampling error (thestandard errors of the variance componentestimates for main effects are notoriouslyhigh; Searle, 1971) but it could also point to asystematic error in the composition of thebooklets. Since it concerns three items only,this effect has not been investigated further.

• The most important variance component is theinteraction between students and items. If themarker effects all vanish, the remaining com-ponents show that the scores are not additivein student effect and item effect. At first sight,this is contradictory to the use of the Raschmodel as the measurement model, but a closerlook reveals an essential difference between theRasch model and the present approach. In IRTmodels, a transformation to an unbounded

γ

PIS

A 2

00

0 t

ech

nic

al r

epor

t

124

Table 29: Variance Components (%) for Mathematics in the Netherlands

Student- Student- Item- Measure-Item Marker Marker ment

Number Student Item Marker Interaction Interaction Interaction Errorof Component Component Component Component Component ComponentComponent

Booklet Items ( ) ( ) ( ) ( ) ( ) ( ) ( )

1 7 37.0 15.6 0.0 41.9 0.0 0.0 5.5

3 3 6.9 43.2 0.1 40.5 0.6 0.0 8.8

5 7 19.3 34.3 0.2 38.3 -0.2 0.2 8.0

8 6 19.3 27.8 0.1 41.8 -0.1 0.4 10.7

9 4 19.3 27.8 0.1 47.8 -0.1 0.1 5.4

All 27 23.6 26.4 0.1 42.1 0.1 0.2 7.5

σε2σ βγ

2σαγ2σαβ

2σ γ2σ β

2σα2

latent scale is used, whereas in the presentapproach the scores themselves are modelled,and since the scores are bounded (and highlydiscrete), differences in difficulty of the itemswill automatically lead to interaction effects.

• Using the bottom row of Table 29, the general-isability coefficient (63) can be computedfor different values of I and M. These estimatesare displayed, using data from the Netherlands,in Table 30. Extensive tables for all participatingcountries are given in Chapter 14. Here also,the result is exemplary for all countries, althoughthe indices for reading are generally slightlylower than for mathematics. But the main con-clusion is that the correlations are quite high,even with one marker, such that the decisionto use a single marker for the open-ended itemsin the main study seems justified post hoc.

Table 30: Estimates of for the MathematicsScale in the Netherlands

I = 8 I = 16 I = 24M = 1 M = 2 M = 1 M = 2 M = 1 M = 2

0.963 0.981 0.977 0.988 0.982 0.997

Inter-country RaterReliability Study-Design

Aletta Grisay

The PISA 2000 quality control procedures includeda limited Inter-Country Rater Reliability (ICR)study, conducted in October-November 2000, to

ρ3

ρ3

Cod

ing

and

mar

ker

reli

abil

ity

stu

die

s

125

investigate the possibility of systematic bias inmarking open-ended reading items. The within-country multiple-marking exercise explored thereliability of the coding completed by the nationalmarkers in each country. The objective of the ICRstudy was to check on whether, globally, or forparticular items, marks given by the differentnational staffs could be considered equivalent.

A subset of 48 Booklet 7s that is, the sampleincluded in the multiple-marking exercise forreading within each country was used for theICR study. Booklet 7 was chosen because itcontained only reading material, and therefore alarge number of reading items (26) requiringdouble marking. All participating countries wererequested to submit these 48 student booklets tothe Consortium, after obliterating any scoregiven by the national markers.

Staff specially trained by the Consortium forthe study and proficient in the various PISAlanguages then marked the booklets. Their markswere compared to the four marks given by thenational markers to the same students’ answers.

All cases of clear discrepancy (between the marksgiven by the independent marker and by all or mostof the national markers) were submitted for adjud-ication to Consortium researchers involved indeveloping the reading material, along with theactual student’s answer (translated for countriesusing languages other than English or French).

Recruitment of InternationalMarkers

The international scoring of booklets from theseven English-speaking PISA countries (withEngland and Scotland considered as separatecountries for this exercise) was done by threeexperienced Australian markers and oneCanadian marker, all selected by the Consortium.The 48 booklets from Australian students weresent to Canada, to be scored by the trainer forPISA reading in English-speaking Canada. Theremaining six English-speaking groups (fromEnglish-speaking Canada, Ireland, New Zealand,Scotland, the United Kingdom and the UnitedStates) were re-scored at ACER by a group ofthree verifiers selected from the Australian PISAadministration. One of these was the trainer forPISA reading in Australia. The other two haddemonstrated a high level of reliability during theAustralian national operational marking. Each

one had sole responsibility for scoring bookletsfrom two countries.

To cover instruction languages used in all otherPISA countries, the Consortium appointed 15 ofthe 25 translators who had already served in theverification of the national translations of PISAmaterial, and were therefore very familiar withit. The selection criteria retained were (i) perfectcommand of the language(s) of the country (orcountries) whose material had to be scored (asfar as possible, the responsible persons werechosen from among those verifiers who masteredmore than one PISA language, so that each couldmark for two or more countries); (ii) perfectcommand of English and/or French, so that eachcould score on the basis of either source versionof the reading Marking Guide (while checking,when needed, the national version); and (iii) asfar as possible, previous experience teachingeither the national language or foreign languagesat the secondary school level.

Inter-country Rater Reliability StudyTraining Sessions

A three-day training session was conducted inBrussels to train the 15 verifiers to use the PISAreading Marking Guide. Consortium experts ledthe training session. For the three Australian veri-fiers who had worked on the Australian markingrecently (in September-October) and were veryfamiliar with the items, a shorter refreshersession was conducted in Melbourne, Australia.No special training or refresher session wasdeemed necessary for the Canadian verifier.

The session materials included the sourceEnglish (and French) versions of readingmarking instructions for Booklet 7 items (and acopy of the national marking guide for eachlanguage other than English, when available),answers to marking queries on Booklet 7received at ACER, the Reading Workshopmaterial (see Chapter 2), and copies of abenchmark set of English student scripts to bescored independently by all participants to verifytheir own marking consistency.

This English set of benchmark responses wasprepared at ACER as a specific training aid.Most were selected from material sent for thepurpose by Ireland, Scotland, the United Kingdomand the United States. Australian booklets andthe query records provided other sources. For

each item, about 15 responses were chosen topresent typical and borderline cases and to illustrateall likely response categories. A score was provi-ded for each response, accompanied in mostcases by an explanation or comment. Scores andexplanations were in hidden text format.

During the session, the verifiers worked throughthe material, item by item. Scoring instructions foreach item were presented, then the workshopexamples were marked and discussed, then theverifiers worked independently on the relatedbenchmark scripts and checked their scores againstthe recommended scores. Problem responses werediscussed with Consortium staff conducting thetraining session before proceeding to the next item.

At the Brussels session, the verifiers also hadthe opportunity to start marking part of thematerial for one of the countries for which theywere responsible, under the supervision of theConsortium staff. They were instructed on howto enter their marks in the ICR study softwareprepared at ACER to help compare their markswith those given by the national markers. Mostfinished marking at least one of the Booklet 7clusters for one of their countries. They then re-ceived an Excel spreadsheet with the cases wheretheir scores differed from those given by thenational markers. They discussed the first fewproblem cases with the Consortium staff, andreceived instructions on how to enter their com-ments and the translation of the student’s answerin a column next to the flagged case, so that theConsortium staff could adjudicate the case.

A total of about 15 hours was needed tofinish scoring the booklets for each country.

Flag Files

For each country, the ICR study software produceda flag file with about 1 248 cases (48 students ×26 items)—numbers varied slightly, since a fewcountries submitted one or two additional booklets,or could not retrieve some of the booklets requested,or submitted booklets with a missing page.

In the file, an asterisk indicated cases where theverifier’s score differed significantly from the fourscores given by the national markers (e.g., 0 v. 1,respectively, or vice versa; or all national markersgiving various partial scores while the verifiergave full or no credit).

Cases with minor discrepancies only (e.g., wherethe verifier agreed with three out of four of the

national markers) were not flagged, nor werecases with national scores too inconsistent to becompared to the verifier’s score (e.g., wherenational marks were 0011 and the verifier’s markwas 1, or where national marks were 0123 and theverifier’s mark was 2). It was considered that thesecases would be identified in the within-countryreliability analysis but would be of less interestfor the ICR study than cases with a more clearorientation towards leniency or harshness. In eachfile, a blank column was left for Consortium staffto adjudicate flagged cases.

Adjudication

In the adjudication stage, Consortium staff checkedall flagged cases to verify if marks were correct,and if both national markers and the verifier hadused wrong codes in these problematic cases.

For the English-speaking countries, the Con-sortium’s adjudicators checked the flagged casesby referring directly to students’ answers in thebooklets. Responses flagged for adjudication weremarked blind by Consortium experts (i.e., theadjudicator saw none of the previous five marks).Subsequently, that mark was checked against theoriginal four national marks and the verifier’smark. If the adjudicator’s mark was differentfrom more than two of the original four nationalmarks, the response was submitted for a furtherblind marking by the other adjudicator.

For the non English-speaking countries, theprocedure was different, since the flagged students’answers usually had to be translated by theverifier before they could be adjudicated. To reducethe costs of the exercise, verifiers were instructed(i) to simply copy but not translate the student’sanswer next to the flagged case for those languagesknown by the adjudicators (i.e., French, German,Italian, Dutch and Spanish), and (ii) to simplyadd a brief comment, without translating theanswer, for cases when they clearly recognisedthat they had entered an incorrect code. All otherflagged cases were translated into English orFrench. Almost all files for the non English-speaking countries were adjudicated twice. Noblind procedure could be used. Each adjudicatorentered comments in the file to explain therationale for their decision, so that a finalconsensus could be reached.

The results of this analysis are given inChapter 14.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

126

Data Cleaning Procedures

Christian Monseur

This chapter presents the data cleaning stepsimplemented during the main survey of PISA 2000.

Data cleaning at theNational Centre

National Project Managers (NPMs) were requiredto submit their national data in KeyQuest® 2000,the generic data entry package developed by Con-sortium staff and pre-configured to include the dataentry forms, referred to later as instruments: theachievement test booklets one to nine (togethermaking up the cognitive data); the SE booklet (alsocognitive, known as booklet 0); multiple-markingsheets; School Questionnaire and StudentQuestionnaire instruments with and without thetwo international options (Computer Familiarityor Information Tech-nology (IT) questionnaireand Cross-Curriculum Competency (CCC)questionnaire); the List of Schools; and theStudent Tracking Forms.

The data were verified at several points(integrity check) from the time of data entry.Validation rules (or range checks) were specifiedfor each variable defined in KeyQuest®, and avariable datum was only accepted if it satisfiedpre-specified validation rules.1 To prevent dupli-cate records, a set of variables assigned to aninstrument were identified as primary keys. Forthe student test booklets, stratum, school andstudent identifications were the primary keys.

Countries were requested to enter data into theStudent Tracking Form module before starting toenter data for cognitive tests or context question-naires. This module, or instrument, containedcomplete student identification as it should haveappeared on the booklet and questionnaire thatthe student received at the testing session. Whenconfigured, KeyQuest® instruments designed forstudent data were linked with the StudentTracking Form so that warning messagesappeared when data operators tried to enterinvalid student identifiers or student identifiersthat did not match a record in the form.

After the data entry process was completed,NPMs were required to implement somechecking procedures using KeyQuest® beforesubmitting data to the Consortium, and torectify any integrity errors. These includedinconsistencies between: the List of Schools andthe School Questionnaire; the Student TrackingForm and achievement test booklets; the StudentTracking Form and the Student Questionnaire;the achievement test booklets and the StudentQuestionnaire; and, in the multiple-marking,reliability data according to the internationaldesign in order to detect other than fourduplicate students per booklet.

NPMs were also required to submit anAdaptation and Modification Form with theirdata, describing all changes they had made tovariables on the questionnaire, including theaddition or deletion of variables or responsecategories for variables, and changes to themeaning of response categories. NPMs were alsorequired to propose recoding rules where thenational data did not match the international data.

1 National centres could modify the configuration ofthe variables, giving a range of values that wassometimes reduced or extended compared toConsortium expectations.

Chapter 11

Data Cleaning at theInternational Centre

Data Cleaning Organisation

Data cleaning was a major component of thePISA 2000 Quality Control and AssuranceProgram. It was of prime importance that theConsortium detected all anomalies andinconsistencies in submitted data and that noerrors were introduced during the cleaning andanalysis phases. To reach these high qualityrequirements, the Consortium implemented dualindependent processing.

Two data analysts developed the PISA 2000data cleaning procedures independently. At eachstep, the procedures were considered completeonly when their application to a fictitiousdatabase and to the first two PISA databasesreceived from countries produced identicalresults and files.

The data files submitted by national centresoften needed specific data cleaning or recodingprocedures, or at least adaptation of standarddata cleaning procedures. Two analysts thereforeindependently cleaned all submitted data files;the cleaning and analysis procedures were runwith both SAS® and SPSS®.

Three teams of data analysts produced thenational databases. A team leader wasnominated within each team and was the onlyindividual to communicate with the nationalcentres. Each team used SAS® and SPSS®.

Data Cleaning Procedures

Because of the potential impact of PISA resultsand the scrutiny to which the data were likely tobe put, it was essential that no dubious recordsremained in the data files. During cleaning asmany dubious records as possible were identified,and through a process of extensive discussionbetween each national centre and the Consortium’sdata processing centre at the Australian Councilfor Educational Research (ACER), an effort wasmade to correct and resolve all data issues. Whenno adequate solution was found, the offendingdata records were deleted.2

PIS

A 2

00

0 t

ech

nic

al r

epor

t

128

Unresolved inconsistencies in student and schoolidentifications also led to the deletion of recordsin the database. Unsolved systematic errors for aparticular item were replaced by not applicable codes.For instance, if countries reported a mistranslationor misprint in the national version of a cognitivebooklet, data for the variables were recoded asnot applicable and were not used in the analyses.Finally, errors or inconsistencies for particularstudents and particular variables were replacedby not applicable codes.

National Adaptations to the Database

When data arrived at the Consortium, the firststep was to check the consistency of the databasestructure with the international databasestructure. An automated procedure wasdeveloped for this purpose. For each instrumentit reported deleted variables, added variables andvariables for which the validation rules (rangechecks) had been changed. This report was thencompared with the information provided by theNPM in the Adaptation and Modification Form.

Once all inconsistencies were resolved, thesubmitted data were recoded where necessary tofit the international structure. All additional ormodified variables were set aside in a separatefile so that countries could use these data fortheir own purposes but they were not includedin the international database and have not beenused in international data processing.

Verifying the Student Tracking Formand the List of Schools

The Student Tracking Form and the List ofSchools were central instruments, because theycontained the information used in computingweight, exclusion, and participation rates. TheStudent Tracking Form contained all studentidentifiers, exclusion and participation codes, thebooklet number assigned and some demographicdata.3 The List of Schools contained, amongother variables, the PISA population size, thegrade population size and the sample size. Theseforms had to be submitted electronically.

The data quality in these two forms and theirconsistency with the booklets and Student

2 Record deletion was strenuously avoided as itdecreased the participation rate.

3 See Appendix 11 for an example.

Questionnaire data were verified for: • consistency of exclusion codes and

participation codes (for each session) with thedata in the test booklets and questionnaires;

• consistency of the sampling information in theList of Schools (i.e., target population size andthe sample size) with the Student TrackingForm;

• within-school student samples selected inaccordance with required internationalprocedures; and

• consistency of demographic information in theStudent Tracking Form (grade, date of birthand gender) with that in the booklets orquestionnaires.

Verifying the Reliability Data

Some 100 checking procedures wereimplemented to check the following componentsof the multiple-marking design (see Chapter 6):• number of records in the reliability files;• number of records in the reliability files and

the corresponding booklets;• marker identification consistency;• marker design;• selection of the booklets for multiple marking;

and• extreme inconsistencies in the marks given by

different markers (see Chapter 14).4

Verifying the Context QuestionnaireData

The Student and School Questionnaire dataunderwent further checks after recoding fornational adaptations. Invalid or suspiciousstudent and school data were reported to anddiscussed with countries. Four types ofconsistency checks were run:• Non-valid sums: For example, two questions

in the School Questionnaire (SCQ04 andSCQ08) each requested the school principalto provide information as a percentage. Thesum of the values had to be 100.

• Implausible values: Consistency checks acrossvariables within instruments combined withthe information of two or more questions todetect suspicious data, like the number of

Dat

a cl

ean

ing

pro

ced

ure

s

129

students and the number of teachers. Outlyingratios were identified and countries wererequested to check the correctness of thenumbers. These checks included:• Identifying numbers of students (SCQ02)

and numbers of teachers (SCQ14). Ratiosoutside ±2 standard deviations wereconsidered outliers;

• Identifying numbers of computers (SCQ13)and numbers of students (SCQ02). Ratiosoutside ±2 standard deviations wereconsidered outliers;

• Comparing the mother’s completededucation level in terms of the InternationalStandard Classification of Education(ISCED, OECD, 1999b) categories forISCED 3A (STQ12) and ISCED 5A or higher(STQ14). If the mother did not completeISCED 3A, she could not have completed5A; and

• Comparing the father’s completededucation level similarly.

• Outliers: Numerical answers from the SchoolQuestionnaire were standardised and outliervalues (±3 standard deviations) were returned tonational centres for checking.5

• Missing data confusion: Possible confusionsbetween valid codes 9, 99, 999, 90, 900 andmissing values were encountered during thefield trial. Therefore, for all numericalvariables, values close to the missing codeswere returned to countries for verification.

Preparing Files forAnalysis

For each PISA participating country, several fileswere prepared for use in analysis:• a processed cognitive file with all student

responses to all items for the three domainsafter coding;

• a raw cognitive file with all student responses toall items for the three domains before coding;

4 For example, some markers reported a missing valuewhile others reported non-zero scores.

5 The questions checked in this manner were schoolsize (SCQ02), instructional time (SCQ06), number ofcomputers (SCQ13), number of teachers (SCQ14),teacher professional development (SCQ15), number ofhours per week in each of test language, mathematicsand science classes (STQ27), class size (STQ28) andschool marks (STQ41).

• a student file with all data from the StudentQuestionnaire and the two internationaloptions (CCC and IT);

• a school file with data from the SchoolQuestionnaire (one set of responses per school);

• a weighting file with information from theStudent Tracking Form and the List of Schoolsnecessary to compute the weights; and

• reliability files—19 files with recoded studentanswers and 19 files with scores, pertaining totwo domains from each of six booklets, threedomains from each of two booklets and onedomain from the remaining booklet, wereprepared to facilitate the reliability analyses.

Processed Cognitive File

For a number of items in the PISA test booklets,a student score on the item was determined bycombining multiple student responses. Mostrecoding required combining two answers intoone and summarising the information from thecomplex multiple-choice items.

In the PISA material, some of the open-endedmathematics and science items were coded in twodigits while all other items were coded with aone-digit mark. ConQuest, the software used toscale the cognitive data, requires items of thesame length. To minimise the size of the data file,the double-digit items were recoded into one-digitvariables using the first digit. All data producedthrough scoring, combining or recoding have Tadded to their item’s variable label.

For items omitted by students, embeddedmissing and non-reached missing items weredifferentiated. All consecutive missing valuesstarting from the end of each cognitive session(that is, from the end of the first hour and fromthe end of the second hour separately) werereplaced by a non-reached code (r), except forthe first value of the missing series. Embeddedand non-reached missing items were treateddifferently in the scaling.

Non-reached items for students who werereported to have left the session earlier thanexpected or arrived later than expected (partparticipants) were considered not applicable inall analyses.

The Student Questionnaire

The Student Questionnaire file includes the datacollected from the main questionnaire, thecomputer familiarity international option andthe cross-curriculum competency internationaloption. If a country did not participate in theinternational options, not applicable codes wereused for all variables from them.

Six derived variables were computed duringthe data cleaning process:• Father’s occupation, mother’s occupation and

the student’s expected occupation at age 30,which were each originally coded using ISCO,were transformed into the Ganzeboom, deGraaf and Treiman (1992) InternationalSocio-Economic Index.

• Question STQ41 regarding school marks is pro-vided in three formats: nominal, ordinal andnumerical. The nominal option is used if thecountry provided data collected withquestion 41b—that is, above the pass mark, atthe pass mark and below the pass mark. Datacollected through question 41a were codedaccording to the reporting policy in the countries.Some countries submitted data in a range of1-5, or 1-7 etc., while others reported studentmarks on a scale with maximum score of 20,or 100. These data were recoded in categoricalformat if fewer than eight categories wereprovided (1-7) or in percentages if more thanseven categories were provided.The selection of students included in the

questionnaire file was based on the same rules asfor the cognitive file.

The School Questionnaire

No modifications other than the correction ofdata errors and the addition of the country four-digit codes were made to the School Questionnairefile. All participating schools, i.e., any school forwhich at least one PISA-eligible student wasassessed, have a record in the internationaldatabase, regardless of whether or not theyreturned the School Questionnaire.

The Weighting Files

The weighting files contained the information fromthe Student Tracking Form and from the List ofSchools. In addition, the following variables werecomputed and included in the weighting files.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

130

• For each of the three domains and for eachstudent, a scalability variable was computed.If all items for one domain were missing, thenthe student was considered not scalable. Thesethree variables were useful for selecting thesample of students who would contribute tothe international item calibration.

• For each student, a participation indicatorwas computed. A student who participated inthe first hour of the testing session6 (and didnot arrive late) was considered a participant.A student who only attended the secondcognitive session and/or StudentQuestionnaire session was not considered aparticipant. This variable was used tocompute the student participation rate.

• For each student, a scorable variable wascomputed. All students who attended at leastone cognitive session (that is, either or bothhours of the session) were consideredscorable. Further, if a student only attendedthe Student Questionnaire session andprovided data for the father’s or mother’soccupation questions, then the student wasalso considered scorable. Therefore, anindicator was also computed to determinewhether the student answered the father’s ormother’s occupation questions.A few countries submitted data with a grade

national option. Therefore, two eligibilityvariables—PISA-eligible and grade-eligible—were also included in the Student TrackingForm. These new variables were also used toselect the records that were included in theinternational database and therefore in thecognitive file and Student Questionnaire file. Tobe included in the international database, thestudent had to be both PISA-eligible andscorable. In other words, all PISA students whoattended one of the cognitive sessions wereincluded in the international database. Thosewho attended only the questionnaire sessionwere included if they provided an answer for thefather’s or the mother’s occupation questions.

PISA students reported in the StudentTracking Form as not eligible, no longer atschool, excluded for physical, mental, orlinguistic reasons, or absent were not included inthe international database. Students who refused

Dat

a cl

ean

ing

pro

ced

ure

s

131

to participate in the assessment were also notincluded in the international database.

All non-PISA students, i.e., students assessedin a few countries for a national or internationalgrade sample option, were excluded from theinternational database. Countries submittingsuch data to the Consortium received themseparately.

The Reliability Files

One file was created for each domain and testbooklet. The data from the reliability bookletswere merged with those in the test booklets sothat each student selected for the multiple-marking process appears four times in these files.

6 Of either the original session or a make-up session.

Sampling Outcomes

Christian Monseur, Keith Rust and Sheila Krawchuk

This chapter reports on PISA sampling outcomes.Details of the sample design are given in Chapter 4.

Table 31 shows the various quality indicatorsfor population coverage and the various piecesof information used to derive them. Thefollowing notes explain the meaning of eachcoverage index and how the data in each columnof the table were used.

Indices 1, 2 and 3 are intended to measurePISA population coverage. Indices 4 and 5 areintended to be diagnostic in cases where indices1, 2 or 3 have unexpected values. Manyreferences are made in this chapter to the variousSampling Forms on which NPMs documentedstatistics and other information needed inundertaking the sampling. The forms themselvesare included in Appendix 9.

Index 1: Coverage of the National DesiredPopulation, calculated by P/(P+E) × 3[c]/3[a].• The National Desired Population (NDP),

defined by Sampling Form 3 response box [a]and denoted here as 3[a], is the populationthat includes all enrolled 15-year-olds in eachcountry (with the possibility of small levels ofexclusions), based on national statistics.However, the final NDP reflected on eachcountry’s school sampling frame might havehad some school-level exclusions. The valuethat represents the population of enrolled 15-year-olds minus those in excluded schoolsis represented by response box [c] onSampling Form 3 and denoted here as 3[c].Thus, the term 3[c]/3[a] provides theproportion of the NDP covered in eachcountry based on national statistics.

• The value (P+E) provides the weightedestimate from the student sample of alleligible 15-year-olds in each country, where Pis the weighted estimate of eligible non-excluded 15-year-olds and E is the weightedestimate of eligible 15-year-olds that wereexcluded within schools. Therefore, the termP/(P+E) provides an estimate based on thestudent sample of the proportion of theeligible 15-year-old population represented bythe non-excluded eligible 15-year-olds.

• Thus the result of multiplying these twoproportions together (3[c]/3[a] and P/(P+E))indicates the overall proportion of the NDPcovered by the non-excluded portion of thestudent sample.

Index 2: Coverage of the National EnrolledPopulation, calculated by P/(P+E) × 3[c]/2[b].• The National Enrolled Population (NEP),

defined by Sampling Form 2 response box [b]and denoted here as 2[b], is the populationthat includes all enrolled 15-year-olds in eachcountry, based on national statistics. The finalNDP, denoted here as 3[c], reflects the 15-year-old population from each country’sschool sampling frame. This value representsthe population of enrolled 15-year-olds lessthose in excluded schools.

• The value (P+E) provides the weightedestimate from the student sample of alleligible 15-year-olds in each country, where Pis the weighted estimate of eligible non-excluded 15-year-olds and E is the weightedestimate of eligible 15-year-olds that wereexcluded within schools. Therefore, the term

section four: quality indicators and outcomes

Chapter 12

P/(P+E) provides an estimate based on thestudent sample of the proportion of theeligible 15-year-old population that isrepresented by the non-excluded eligible 15-year-olds.

• Multiplying these two proportions together(3[c]/2[b] and P/(P+E)) gives the overallproportion of the NEP that is covered by thenon-excluded portion of the student sample.

Index 3: Coverage of the National 15-Year-OldPopulation, calculated by P/2[a].• The National Population of 15-year-olds,

defined by Sampling Form 2 response box [a]and denoted here as 2[a], is the entirepopulation of 15-year-olds in each country(enrolled and not enrolled), based on nationalstatistics. The value P is the weighted estimateof eligible non-excluded 15-year-olds from thestudent sample. Thus P/2[a] indicates theproportion of the National Population of 15-year-olds covered by the non-excludedportion of the student sample.

Index 4: Coverage of the Estimated SchoolPopulation, calculated by (P+E)/S.• The value (P+E) provides the weighted

estimate from the student sample of alleligible 15-year-olds in each country, where Pis the weighted estimate of eligible non-excluded 15-year-olds and E is the weightedestimate of eligible 15-year-olds who wereexcluded within schools.

• The value S is an estimate of the 15-year-oldschool population in each country. This isbased on the actual or (more often)approximate number of 15-year-olds enrolledin each school in the sample, prior tocontacting the school to conduct theassessment. The S value is calculated as thesum over all sampled schools of the productof each school’s sampling weight and itsnumber of 15-year-olds (ENR) as recorded onthe school sampling frame. In the infrequentcase where the ENR value was not available,the number of 15-year-olds from the StudentTracking Form was used.

• Thus, (P+E)/S is the proportion of theestimated school 15-year-old population that isrepresented by the weighted estimate from thestudent sample of all eligible 15-year-olds. Itspurpose is to check whether the student

sampling has been carried out correctly, and toassess whether the value of S is a reliablemeasure of the number of enrolled 15-year-olds.This is important for interpreting Index 5.

Index 5: Coverage of the School Sampling FramePopulation, calculated by S/3[c].• The value S/3[c] is the ratio of the enrolled

15-year-old population, as estimated fromdata on the school sampling frame, to the sizeof the enrolled student population, asreported on Sampling Form 3. In some cases,this provides a check as to whether the dataon the sampling frame give a reliable estimateof the number of 15-year-olds in each school.In other cases, however, it is evident that 3[c]has been derived using data from the samplingframe by the National Project Manager, sothat this ratio may be close to 1.0 even ifenrolment data on the school sampling frameare poor. Under such circumstances, Index 4will differ noticeably from 1.0, and the figurefor 3[c] will also be inaccurate.Tables 32, 33 and 34 present school and

student-level response rates. Table 32 indicatesthe rates calculated by using only originalschools and no replacement schools. Table 33indicates the improved response rates when firstand second replacement schools were accountedfor in the rates. Table 34 indicates the studentresponse rates among the full set of participatingschools.

For calculating school response rates beforereplacement, the numerator consisted of all originalsample schools with enrolled age-eligible studentswho participated (i.e., assessed a sample of eligiblestudents, and obtained a student response rate ofat least 50 per cent). The denominator consistedof all the schools in the numerator, plus thoseoriginal sample schools with enrolled age-eligiblestudents that either did not participate or failedto assess at least 50 per cent of eligible samplestudents. Schools that were included in thesampling frame, but were found to have no age-eligible students, were omitted from thecalculation of response rates. Replacementschools do not figure in these calculations.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

134

Sam

pli

ng

outc

omes

135Tab

le 3

1:

Sam

plin

g an

d C

over

age

Rat

es

Sam

plin

g Sh

eet

Info

rmat

ion

Sam

ple

Info

rmat

ion

Part

icip

ants

Excl

uded

Inel

igib

leEl

igib

leC

over

age

Indi

ces

All

Enro

lled

Nat

iona

lSc

hool

Nat

iona

l%

Enro

lled

Act

ual

W’td

Act

ual

W’td

Act

ual

W’td

Act

ual

W’td

With

in-

Ove

r-In

eli-

15-

15-

Tar

get

Leve

lT

arge

tSc

hool

-St

uden

tsSc

hool

all

gibl

eye

ar-

year

-Ex

cl

Min

usLe

vel

onEx

cl’n

Excl

’n(%

)1

23

45

olds

olds

’ns

Scho

ol

Excl

Scho

olR

ate

Rat

eEx

cl’n

s’n

sFr

ame

(%)

(%)

SF 2

[a]

SF 2

[b]

SF 3

[a]

SF 3

[b]

SF 3

[c]

3[b]

/3[a

] S

PE

I &

WP+

EE/

(P+E

)

Aus

tral

ia

266

878

248

908

248

738

2 85

024

5 88

81.

1524

4 15

75

176

229

152

632

688

233

6 79

55

239

231

840

1.16

2.29

2.93

0.98

0.98

0.86

0.95

0.99

Aus

tria

95

041

90 3

5490

354

3290

322

0.04

86 6

014

745

71 5

4741

500

259

3 38

94

786

72 0

470.

690.

734.

700.

990.

99•7

•70.

96

Bel

gium

112

1 12

111

9 05

511

8 97

21

091

117

881

0.92

117

836

6 67

011

0 09

510

01

596

3657

26

770

111

691

1.43

2.33

0.51

0.98

0.98

0.91

0.95

1.00

Bel

gium

(Fl.)

71 0

7468

995

68 9

1289

768

015

1.30

69 1

103

890

61 1

4325

428

1112

83

915

61 5

710.

701.

990.

210.

980.

980.

860.

891.

02

Bel

gium

(Fr.)

49 2

8949

289

49 2

8919

449

095

0.39

48 7

262

780

48 9

5275

1 16

825

443

2 85

550

120

2.33

2.72

0.88

0.97

0.97

0.99

1.03

0.99

Bra

zil2

3 4

64 3

301

841

843

1 83

7 23

66

633

1 83

0 60

30.

3624

90 7

884

893

2 4

02 2

8014

7 84

221

780

666

4 90

72

410

122

0.33

0.69

3.35

0.99

0.99

0.69

0.97

1.36

Can

ada

403

803

396

423

391

788

2 03

538

9 99

00.

5238

1 16

529

687

348

481

1 58

416

197

1 70

919

032

31 2

7136

4 67

84.

444.

945.

220.

950.

940.

860.

960.

98

Cze

ch R

epub

lic

134

627

132

508

132

508

2 18

113

0 32

71.

6512

9 42

25

365

125

639

1329

781

1 61

95

378

125

936

0.24

1.88

1.29

0.98

0.98

0.93

0.97

0.99

Den

mar

k 53

693

52 1

6152

161

345

51 8

160.

6650

236

4 23

547

786

119

1 19

599

1 06

84

354

48 9

812.

443.

082.

180.

970.

970.

890.

980.

97

Finl

and

66 5

7166

561

66 3

1955

065

769

0.83

65 8

754

864

62 8

2658

673

2226

14

922

63 4

991.

061.

880.

410.

980.

980.

940.

961.

00

Fran

ce

788

387

788

387

750

460

17 7

2873

2 73

22.

3674

4 75

44

673

730

494

598

208

416

077

4 73

273

8 70

21.

113.

450.

820.

970.

920.

930.

991.

02

Ger

man

y 92

7 47

392

4 54

992

4 54

95

423

919

126

0.59

935

223

5 07

382

6 81

660

9 16

38

1 05

35

133

835

979

1.10

1.68

0.13

0.98

0.98

0.89

0.89

1.02

Gre

ece

128

175

124

656

124

187

200

123

987

0.16

110

622

3 64

411

1 36

321

682

351

039

3 66

511

2 04

50.

610.

770.

930.

990.

990.

871.

010.

89

Hun

gary

312

0 75

911

5 32

511

5 32

50

115

325

0.00

131

291

4 88

710

7 46

034

765

791

762

4 92

110

8 22

40.

710.

711.

630.

990.

990.

890.

821.

14

Icel

and

4 06

24

044

4 04

418

4 02

60.

454

020

3 37

23

869

7979

1515

3 45

13

949

2.01

2.44

0.38

0.98

0.98

0.95

0.98

1.00

Irel

and

65 3

3964

370

63 5

721

021

62 5

511.

6162

138

3 85

456

209

134

1 73

412

01

501

3 98

857

942

2.99

4.55

2.59

0.95

0.94

0.86

0.93

0.99

Ital

y 58

4 41

757

4 86

457

4 86

477

557

4 08

90.

1356

2 76

34

984

510

792

117

12 2

470

05

101

523

039

2.34

2.47

0.00

0.98

0.98

0.87

0.93

0.98

Japa

n 1

490

000

1 48

5 26

91

459

296

34 1

241

425

172

2.34

1 42

0 53

35

256

1 44

6 59

60

00

05

256

1 44

6 59

60.

002.

340.

000.

980.

960.

971.

021.

00

Kor

ea

712

812

602

605

602

605

1 82

060

0 78

50.

3058

9 01

84

982

579

109

682

681

8 72

24

988

579

936

0.14

0.44

1.50

1.00

1.00

0.81

0.98

0.98

Latv

ia38

000

35 9

8135

981

886

35 0

952.

4635

629

3 92

030

063

6240

29

603

982

30 4

651.

323.

750.

200.

960.

960.

790.

861.

02

Liec

hten

stei

n 41

532

632

60

326

0.00

327

314

325

22

00

316

327

0.61

0.61

0.00

0.99

0.99

0.78

1.00

1.00

Luxe

mbo

urg

4 55

64

556

4 55

641

64

140

9.13

4140

3 52

84

138

00

00

3 52

84

138

0.00

9.13

0.00

0.91

0.91

0.91

1.00

1.00

Mex

ico

2 12

7 50

41

098

605

1 07

3 31

701

073

317

0.00

1 06

3 52

44

600

960

011

256

439

274

897

4 60

296

0 57

50.

060.

067.

801.

000.

980.

450.

900.

99

PIS

A 2

00

0 t

ech

nic

al r

epor

t

136 Tab

le 3

1 (c

ont.)

Sam

plin

g Sh

eet

Info

rmat

ion

Sam

ple

Info

rmat

ion

Part

icip

ants

Excl

uded

Inel

igib

leEl

igib

leC

over

age

Indi

ces

All

Enro

lled

Nat

iona

lSc

hool

-N

atio

nal

%En

rolle

dA

ctua

lW

’tdA

ctua

lW

’tdA

ctua

lW

’tdA

ctua

lW

’tdW

ithin

-O

ver-

Inel

i-15

-15

-T

arge

tLe

vel

Tar

get

Scho

ol-

Stud

ents

Scho

olal

lgi

ble

year

- ye

ar-

Excl

M

inus

Leve

lon

Excl

’nEx

cl’n

(%)

12

34

5ol

dsol

ds’n

sSc

hool

Ex

clSc

hool

Rat

eR

ate

Excl

’ns

’ns

Fram

e(%

)(%

)

SF 2

[a]

SF 2

[b]

SF 3

[a]

SF 3

[b]

SF 3

[c]

3[b]

/3[a

] S

PE

I &

WP+

EE/

(P+E

)

Net

herl

ands

17

8 92

417

8 92

417

8 92

47

800

171

124

4.36

180

697

2 50

315

7 32

71

2334

2 60

62

504

157

350

0.01

4.37

1.66

0.96

0.96

0.88

0.87

1.06

New

Zea

land

54

220

51 4

6451

464

976

50 4

881.

9050

645

3 66

746

757

137

1 59

024

52

712

3 80

448

347

3.29

5.12

5.61

0.95

0.95

0.86

0.95

1.00

Nor

way

52

165

51 5

8751

474

420

51 0

540.

8250

271

4 14

749

579

9394

429

278

4 24

050

523

1.87

2.67

0.55

0.97

0.97

0.95

1.01

0.98

Pola

nd4

665

500

643

528

643

528

56 5

2458

7 00

48.

7854

6 84

23

654

542

005

535

484

263

399

3 70

754

7 48

91.

009.

700.

620.

900.

900.

811.

000.

93

Port

ugal

13

2 32

512

7 16

512

7 16

50

127

165

0.00

126

505

4 58

599

998

122

2 77

726

55

284

4 70

710

2 77

52.

702.

705.

140.

970.

970.

760.

810.

99

Rus

sian

Fed

erat

ion

2 26

8 56

62

259

985

2 25

9 98

510

867

2 24

9 11

80.

482

249

118

6 70

11

968

131

224

960

137

40 1

666

723

1 97

3 09

10.

250.

732.

040.

990.

990.

870.

881.

00

Spai

n 46

2 08

245

1 68

545

1 68

52

180

449

505

0.48

444

288

6 21

439

9 05

515

38

998

180

10 9

736

367

408

053

2.21

2.68

2.69

0.97

0.97

0.86

0.92

0.99

Swed

en

100

940

100

940

100

940

1 36

099

580

1.35

100

578

4 41

694

338

174

3 34

932

671

4 59

097

687

3.43

4.73

0.69

0.95

0.95

0.93

0.97

1.01

Switz

erla

nd

81 3

5079

232

79 2

3295

478

278

1.20

97 1

626

100

72 0

1062

822

150

1 99

86

162

72 8

331.

132.

322.

740.

980.

980.

890.

751.

24

Uni

ted

Kin

gdom

573

1 74

370

5 87

566

9 87

517

674

652

201

2.64

654

095

9 34

064

3 04

121

915

990

261

13 4

139

559

659

030

2.43

5.00

2.04

0.95

0.90

0.88

1.01

1.00

Engl

and6

603

100

580

546

580

546

16 1

2156

4 42

52.

7856

7 20

44

120

560

248

133

15 3

0910

010

973

4 25

357

5 55

82.

665.

361.

910.

950.

950.

931.

011.

00

Nor

ther

n Ir

elan

d 26

043

26 4

7726

477

436

26 0

411.

6526

129

2 84

925

757

8668

158

447

2 93

526

438

2.57

4.18

1.69

0.96

0.96

0.99

1.01

1.00

Scot

land

65

200

62 8

5262

852

1 11

761

735

1.78

60 7

622

371

57 0

350

010

31

993

2 37

157

035

0.00

1.78

3.49

0.98

0.98

0.87

0.94

0.98

Uni

ted

Stat

es

3 87

6 00

03

836

000

3 83

6 00

00

3 83

6 00

00.

003

567

961

3 84

63

121

874

211

132

543

221

148

374

4 05

73

254

417

4.07

4.07

4.56

0.96

0.96

0.81

0.91

0.93

Not

es:

1.T

he s

ampl

ing

form

num

bers

for

Bel

gium

exc

eed

the

sum

of

the

two

part

s be

caus

e th

e G

erm

an-s

peak

ing

com

mun

ity

in B

elgi

um i

s al

so i

nclu

ded

in t

hese

num

bers

.2.

Bra

zil’s

elig

ible

pop

ulat

ion

cons

ists

of

15-y

ear-

olds

in

grad

es 7

to

10.

The

las

t tw

o co

vera

ge i

ndic

es a

re n

ot a

vaila

ble

beca

use

Sis

an

over

esti

mat

e of

the

non

-gra

de 5

and

6 p

opul

atio

n (i

t on

lyex

clud

es s

choo

ls w

ith

only

gra

des

5 an

d 6)

.3.

Hun

gary

use

d gr

ade

9 en

rolm

ents

for

pri

mar

y sc

hool

s, w

hich

was

not

a g

ood

mea

sure

of

size

. T

here

fore

, th

e va

lue

of S

has

been

rec

alcu

late

d as

122

609

.74

(sum

of

scho

ol b

ase

wei

ghts

wei

ghte

d by

MO

Sfo

r th

e se

cond

ary

scho

ols)

plu

s 8

681

(sum

of

stud

ent

wei

ghts

fro

m t

he p

rim

ary

scho

ols

stra

tum

)=13

1 29

0.74

4.P

rim

ary

scho

ols

in P

olan

d w

ere

not

rand

omly

sam

pled

and

the

refo

re t

hese

stu

dent

s co

uld

not

be u

sed.

As

a re

sult

, th

e 6.

7 pe

r ce

nt o

f th

e po

pula

tion

acc

ount

ed f

or b

y th

ese

stud

ents

are

adde

d to

the

sch

ool-

leve

l ex

clus

ion

figu

re.

5.T

he S

ampl

ing

Form

2 n

umbe

rs f

or t

he U

nite

d K

ingd

om e

xcee

d th

e su

m o

f th

e th

ree

part

s be

caus

e W

ales

is

also

inc

lude

d in

the

se n

umbe

rs.

Thi

s al

so a

ffec

ts I

ndic

es 2

and

3 f

or t

he U

nite

dK

ingd

om.

6.O

n Sa

mpl

ing

Form

3,

the

wit

hin-

scho

ol e

xclu

sion

fig

ure

for

Eng

land

was

1 2

58.

Aft

er s

ampl

ing,

spe

cial

edu

cati

on s

choo

ls w

ere

also

exc

lude

d. T

hey

acco

unt

for

anot

her

14 8

63 f

or a

new

tota

l sc

hool

-lev

el e

xclu

sion

of

16 1

21.

7.(a

) St

uden

ts i

n vo

cati

onal

sch

ools

are

enr

olle

d on

a p

art-

tim

e/pa

rt-y

ear

basi

s. (

b) S

ince

the

PIS

A a

sses

smen

t w

as c

ondu

cted

onl

y at

one

poi

nt i

n ti

me,

the

se s

tude

nts

are

capt

ured

onl

y pa

rtia

lly.

(c)

It i

s no

t po

ssib

le t

o as

sess

how

wel

l th

e st

uden

ts s

ampl

ed f

rom

voc

atio

nal

scho

ols

repr

esen

t th

e un

iver

se o

f st

uden

ts e

nrol

led

in v

ocat

iona

l sc

hool

s, a

nd s

o th

ose

stud

ents

not

att

endi

ngcl

asse

s at

the

tim

e of

the

PIS

A a

sses

smen

t ar

e no

t re

pres

ente

d in

the

PIS

A r

esul

ts.

For calculating school response rates afterreplacement, the numerator consisted of all sampleschools (original plus replacement) with enrolledage-eligible students that participated (i.e.,assessed a sample of eligible students and obtaineda student response rate of at least 50 per cent).The denominator consisted of all the schools inthe numerator that were in the original sample,plus those original sample schools that had age-eligible students enrolled, but that failed toassess at least 50 per cent of eligible samplestudents and for which no replacement schoolparticipated. Schools that were included in thesampling frame, but were found to contain noage-eligible students, were omitted from thecalculation of response rates. Replacementschools were included only when theyparticipated, and were replacing a refusingschool that had age-eligible students.

In calculating weighted school response rates,each school received a weight equal to theproduct of its base weight (the reciprocal of itsselection probability) and the number of age-eligible students enrolled, as indicated on thesampling frame.

With the use of probability proportional-to-size sampling, in countries with few certaintyschool selections and no over-sampling or under-sampling of any explicit strata, weighted andunweighted rates are very similar.

Thus, the weighted school response ratebefore replacement is given by the formula:

, (79)

where Y denotes the set of responding originalsample schools with age-eligible students, Ndenotes the set of eligible non-respondingoriginal sample schools, Wi denotes the baseweight for school i, Wi = 1/Pi, where Pi denotesthe school selection probability for school i, andEi denotes the enrolment size of age-eligiblestudents, as indicated on the sampling frame.

The weighted school response rate, afterreplacement, is given by the formula:

, (80)

where Y denotes the set of responding originalsample schools, R denotes the set of respondingreplacement schools, for which thecorresponding original sample school waseligible but was non-responding, N denotes theset of eligible refusing original sample schools,Wi denotes the base weight for school i, Wi = 1/Pi, where Pi denotes the school selectionprobability for school i, and for weighted rates,Ei denotes the enrolment size of age-eligiblestudents, as indicated on the sampling frame.

For unweighted student response rates, thenumerator is the number of students for whomassessment data were included in the results. Thedenominator is the number of sampled studentswho were age-eligible, and not explicitlyexcluded as student exclusions. The exception iscases where countries applied different samplingrates across explicit strata. In these cases,unweighted rates were calculated in eachstratum, and then weighted together accordingto the relative population size of 15-year-olds ineach stratum.

For weighted student response rates, the samenumber of students appear in the numerator anddenominator as for unweighted rates, but eachstudent was weighted by its student base weight.This is given as the product of the school baseweight—for the school in which the student isenrolled—and the reciprocal of the studentselection probability within the school.

In countries with no over-sampling of anyexplicit strata, weighted and unweighted studentparticipation rates are very similar.

Overall response rates are calculated as theproduct of school and student response rates.Although overall weighted and unweighted ratescan be calculated, there is little value in presentingoverall unweighted rates. The weighted ratesindicate the proportion of the student populationrepresented by the sample prior to making theschool and student non-response adjustments.

=

WiEii∈ (Y ∪ R )∑

WiEii∈ (Y ∪ N )∑

weighted school responserate after replacement

weighted school responserate before replacement

=WiEi

i∈ Y∑

WiEii∈ (Y ∪ N )∑

Sam

pli

ng

outc

omes

137

PIS

A 2

00

0 t

ech

nic

al r

epor

t

138

Table 32: School Response Rates Before Replacements

Country Weighted School Number of Number of Number of Number ofParticipation Responding Schools Responding RespondingRate Before Schools Sampled Schools and Non-

Replacement (%) (Weighted by (Responding and (Unweighted) RespondingEnrolment) Non-Responding) Schools

(Weighted by (Unweighted)Enrolment)

Australia 80.95 197 639 244 157 206 247Austria 99.38 86 062 86 601 212 213Belgium 69.12 81 453 117 836 171 252

Belgium (Fl.) 61.51 42 506 69 110 91 150Belgium (Fr.) 79.93 38 947 48 726 80 102

Brazil 97.38 2 425 608 2 490 788 315 324Canada 87.91 335 100 381 165 1 069 1 159Czech Republic 95.30 123 345 129 422 217 229Denmark 83.66 42 027 50 236 194 235Finland 96.82 63 783 65 875 150 155France 94.66 704 971 744 754 173 184Germany 94.71 885 792 935 222 213 220Greece 83.91 92 824 110 622 98 141Hungary 98.67 209 153 211 969 193 195Iceland 99.88 4 015 4 020 130 131Ireland 85.56 53 164 62 138 132 154Italy 97.90 550 932 562 763 167 170Japan 82.05 1 165 576 1 420 533 123 150Korea 100.00 589 018 589 018 146 146Latvia 82.39 29 354 35 628 143 174Liechtenstein 100.00 327 327 11 11Luxembourg 93.04 3 852 4 140 23 25Mexico 92.69 985 745 1 063 524 170 182Netherlands 27.13 49 019 180 697 49 180New Zealand 77.65 39 328 50 645 136 178Norway 85.95 43 207 50 271 164 191Poland 79.11 432 603 546 842 119 150Portugal 95.27 120 521 126 505 145 152Russian Federation 98.84 4 445 841 4 498 235 237 242Spain 95.41 423 900 444 288 177 185Sweden 99.96 100 534 100 578 159 160Switzerland 91.81 89 208 97 162 271 311United Kingdom 61.27 400 737 654 095 292 430

England 58.83 333 674 567 204 106 180Northern Ireland 70.90 18 525 26 129 100 142Scotland 79.88 48 537 60 762 86 108

United States 56.42 2 013 101 3 567 961 116 210

Table 33: School Response Rates After Replacements

Country Weighted School Number of Number of Number of Number ofParticipation Responding Schools Responding RespondingRate After Schools Sampled Schools and Non-

Replacement (%) (Weighted by (Responding and (Unweighted) RespondingEnrolment) Non-Responding) Schools

(Weighted by (Unweighted)Enrolment)

Australia 93.65 228 668 244 175 228 247Austria 100.00 86 601 86 601 213 213Belgium 85.52 100 833 117 911 214 252

Belgium (Fl.) 79.75 55 144 69 142 119 150Belgium (Fr.) 93.68 45 689 48 770 95 102

Brazil 97.96 2 439 152 2 489 942 318 324Canada 93.31 355 644 381 161 1 098 1 159Czech Republic 99.01 128 551 129 841 227 229Denmark 94.86 47 689 50 271 223 235Finland 100.00 65 875 65 875 155 155France 95.23 709 454 744 982 174 184Germany 94.71 885 792 935 222 213 220Greece 99.77 130 555 130 851 139 141Hungary 98.67 209 153 211 969 193 195Iceland 99.88 4 015 4 020 130 131Ireland 87.53 54 388 62 138 135 154Italy 100.00 562 755 562 755 170 170Japan 90.05 1 279 121 1 420 533 135 150Korea 100.00 589 018 589 018 146 146Latvia 88.51 31 560 35 656 153 174Liechtenstein 100.00 327 327 11 11Luxembourg 93.04 3 852 4 140 23 25Mexico 100.00 1 063 524 1 063 524 182 182Netherlands 55.50 100 283 180 697 100 180New Zealand 86.37 43 744 50 645 152 178Norway 92.25 46 376 50 271 176 191Poland 83.21 455 870 547 847 126 150Portugal 95.27 120 521 126 505 145 152Russian Federation 99.29 4 466 335 4 498 235 238 242Spain 100.00 444 288 444 288 185 185Sweden 99.96 100 534 100 578 159 160Switzerland 95.84 92 888 96 924 282 311United Kingdom 82.14 537 219 654 022 349 430

England 82.32 466 896 567 204 148 180Northern Ireland 79.37 20 755 26 150 113 142Scotland 81.70 49 568 60 668 88 108

United States 70.33 2 503 666 3 559 661 145 210

Sam

pli

ng

outc

omes

139

PIS

A 2

00

0 t

ech

nic

al r

epor

t

140

Table 34: Student Response Rates After Replacements

Country Weighted Student Number Number Number of Number ofParticipation of Students of Students Students Assessed Students

Rate After Second Assessed Sampled (Unweighted) SampledReplacement (%) (Weighted) (Assessed and (Assessed and

Absent) Absent)(Weighted) (Unweighted)

Australia 84.24 161 607 191 850 5 154 6 173

Austria 91.64 65 562 71 547 4 745 5 164

Belgium 93.30 88 816 95 189 6 648 7 103

Belgium (Fl.) 95.49 46 801 49 012 3 874 4 055

Belgium (Fr.) 90.98 42 014 46 177 2 774 3 048

Brazil 87.15 1 463 000 1 678 789 4 885 5 613

Canada 84.89 276 233 325 386 29 461 33 736

Czech Republic 92.76 115 371 124 372 5 343 5 769

Denmark 91.64 37 171 40 564 4 212 4 592

Finland 92.80 58 303 62 826 4 864 5 237

France 91.19 634 276 695 523 4 657 5 115

Germany 85.65 666 794 778 516 4 983 5 788

Greece 96.83 136 919 141 404 4 672 4 819

Hungary 95.31 100 807 105 769 4 883 5 111

Iceland 87.09 3 372 3 872 3 372 3 872

Ireland 85.59 42 088 49 172 3 786 4 424

Italy 93.08 475 446 510 792 4 984 5 369

Japan 96.34 1 267 367 1 315 462 5 256 5 450

Korea 98.84 572 767 579 470 4 982 5 045

Latvia 90.73 24 403 26 895 3 915 4 305

Liechtenstein 96.62 314 325 314 325

Luxembourg 89.19 3 434 3 850 3 434 3 850

Mexico 93.95 903 100 961 283 4 600 4 882

Netherlands 84.03 72 656 86 462 2 503 2 958

New Zealand 88.23 35 616 40 369 3 667 4 163

Norway 89.28 40 908 45 821 4 147 4 665

Poland 87.70 393 675 448 904 3 639 4 169

Portugal 86.28 82 395 95 493 4 517 5 232

Russian Federation 96.21 1 903 348 1 978 266 6 701 6 981

Spain 91.78 366 301 399 100 6 214 6 764

Sweden 87.96 82 956 94 312 4 416 5 017

Switzerland 95.13 65 677 69 037 6 084 6 389

United Kingdom 80.97 419 713 518 358 9 250 11 300

England 81.07 366 277 451 828 4 099 5 047

Northern Ireland 86.01 17 049 19 821 2 825 3 281

Scotland 77.90 36 387 46 708 2 326 2 972

United States 84.99 1 801 229 2 119 392 3 700 4 320

Design Effect andEffective Sample Size

Surveys in education and especially internationalsurveys rarely sample students by simplyselecting a random sample of students (a simplerandom sample). Schools are first selected and,within each selected school, classes or studentsare randomly sampled. Sometimes, geographicareas are first selected before sampling schoolsand students. This sampling design is usuallyreferred to as a cluster sample or a multi-stagesample.

Selected students attending the same schoolcannot be considered as independentobservations as they can be with a simplerandom sample because they are usually moresimilar than students attending distincteducational institutions. For instance, they areoffered the same school resources, may have thesame teachers and therefore are taught acommon implemented curriculum, and so on.School differences are also larger if differenteducational programs are not available in allschools. One expects to observe greaterdifferences between a vocational school and anacademic school than between twocomprehensive schools.

Furthermore, it is well known that within acountry, within sub-national entities and withina city, people tend to live in areas according totheir financial resources. As children usuallyattend schools close to their house, it is likelythat students attending the same school comefrom similar social and economic backgrounds.

A simple random sample of 4 000 students isthus likely to cover the diversity of thepopulation better than a sample of 100 schoolswith 40 students observed within each school. Itfollows that the uncertainty associated with anypopulation parameter estimate (i.e., standarderror) will be larger for a clustered sample thanfor a simple random sample of the same size.

To limit this reduction of precision in thepopulation parameter estimate, multi-stagesample designs usually use complementaryinformation to improve coverage of thepopulation diversity. In PISA, and in previousinternational surveys, the following techniqueswere implemented to limit the increase instandard errors: (i) explicit and/or implicitstratification of the school sample frame and

(ii) selection of the schools with probabilitiesproportional to their size. Complementaryinformation generally cannot fully compensatefor the increase in the standard errors due to themulti-stage design, however.

The reduction in design efficiency is usuallyreported through the ‘design effect’ (Kish, 1965)that describes the ‘ratio of the variance of theestimate obtained from the (more complex)sample to the variance of the estimate that wouldhave been obtained from a simple random sampleof the same number of units. The design effect hastwo primary uses in sample size estimation andin appraising the efficiency of more complexplans.’ (Cochran, 1977).

In PISA, a design effect has been computedfor a statistic t using:

, (81)

where is the sampling variance for thestatistic t computed by the BRR replicationmethod (see Chapter 8), and is thesampling variance for the same statistic t on thesame data base but considering the sample as asimple random sample.

Another way to express the lack of precisiondue to the complex sampling design is throughthe effective sample size, which expresses thesimple random sample size that would give thesame standard error as the one obtained fromthe actual complex sample design. In PISA, theeffective sample size for statistic t is equal to:

, (82)

where n is equal to the actual number of units inthe sample.

The notion of design effect as given in (81) isextended and produces five design effects todescribe the influence of the sampling and testdesigns on the standard errors for statistics.

The total errors computed for the internationalPISA initial report consist of two components:sampling variance and measurement variance. Thestandard error in PISA is inflated because thestudents were not sampled according to a simplerandom sample and also because the measure ofthe student proficiency estimates includes someamount of random error.

Effn tn

Deff t

n Var t

Var tSRS

BRR3

3( )

( )

( )

( )= =

×

Var tSRS ( )

Var tBRR ( )

Deff tVar t

Var tBRR

SRS3 ( )

( )

( )=

Sam

pli

ng

outc

omes

141

For any statistic t, the population estimateand the sampling variance are computed foreach plausible value and then combined asdescribed in Chapter 9.

The five design effects and their respectiveeffective sample sizes that are considered aredefined as follows:

, (83)

where is the measurement variance forthe statistic t. This design effect shows the inflationof the total variance that would have occurreddue to measurement error if in fact the samplewere considered a simple random sample.

(84)

shows the inflation of the total variance due onlyto the use of the complex sampling design.

(85)

shows the inflation of the sampling variance dueto the use of the complex design.

(86)

shows the inflation of the total variance due tothe measurement error.

(87)

shows the inflation of the total variance due tothe measurement error and due to the complexsampling design.

The product of the first and second designeffects is equal to the product of the third andfourth design effects, and both products areequal to the fifth design effect.

Tables 35 to 39 provide the design effects andthe effective sample sizes, respectively, for themeans of the combined reading, mathematicaland scientific literacy scales, the percentage ofstudents at level 3 on the combined readingliteracy scale, and the percentage of students atlevel 5 on the combined reading literacy scale.

The results show that the design effectsdepend on the computed statistics. It is wellknown that the design effects due to multi-stagesample designs are usually high for statisticssuch as means but much lower for relationalstatistics such as correlation or regressioncoefficients.

Because the samples for the mathematics andscience scales are drawn from the same schoolsas that for the combined reading scale, but withmuch fewer students, it follows that the readingsample is much more clustered than for thescience and mathematics samples. Therefore it isnot surprising to find that design effects aregenerally substantially higher for reading thanfor mathematics and science.

The design effect due to the multi-stagesample is generally quite small for the percentageof students at level 3 but generally higher for thepercentage of students at level 5. Recoding thestudent proficiency estimates into being or notbeing at level 3 suppresses the distinctionbetween high and low achievers and thereforeconsiderably reduces the school variance.Recoding student proficiency estimates intobeing or not being at level 5, however, keeps thedistinction between high and lower achieversand preserves some of the school variance.

The measurement error for the minordomains is not substantially higher than themeasurement error for the major domainbecause the proficiency estimates were generatedwith a multi-dimensional model using a large setof variables as conditioning variables. Thiscomplementary information has effectivelyreduced the measurement error for the minordomain proficiency estimates.Deff t

Var t MVar t

Var tBRR

SRS5 ( )

( ) ( )

( )=

+

Deff tVar t MVar t

Var tBRR

BRR4 ( )

( ) ( )

( )=

+

Deff tVar t

Var tBRR

SRS3 ( )

( )

( )=

Deff tVar t MVar t

Var t MVar tBRR

SRS2 ( )

( ) ( )

( ) ( )=

++

MVar t( )

Deff tVar t MVar t

Var tSRS

SRS1( )

( ) ( )

( )=

+

PIS

A 2

00

0 t

ech

nic

al r

epor

t

142

Sam

pli

ng

outc

omes

143

Tab

le 3

5:

Des

ign

Eff

ects

and

Eff

ecti

ve S

ampl

e Si

zes

for

the

Mea

n Pe

rfor

man

ce o

n th

e C

ombi

ned

Rea

ding

Lit

erac

y Sc

ale

Des

ign

Des

ign

Des

ign

Des

ign

Des

ign

Eff

ecti

ve

Eff

ecti

veE

ffec

tive

Eff

ecti

ve

Eff

ecti

veE

ffec

t 1

Eff

ect

2E

ffec

t 3

Eff

ect

4E

ffec

t 5

Sam

ple

Sam

ple

Sam

ple

Sam

ple

Sam

ple

Size

1Si

ze 2

Size

3Si

ze 4

Size

5

Aus

tral

ia1.

304.

775.

901.

056.

203

983

1 08

587

74

926

835

Aus

tria

1.06

2.98

3.10

1.02

3.16

4 48

31

590

1 53

14

657

1 50

2B

elgi

um1.

066.

967.

311.

017.

376

302

958

912

6 61

790

5B

razi

l1.

195.

326.

141.

036.

334

112

920

797

4 74

677

3C

anad

a1.

097.

417.

971.

018.

0627

294

4 00

93

726

29 3

643

686

Cze

ch R

epub

lic1.

073.

043.

181.

023.

255

019

1 76

61

688

5 25

11

652

Den

mar

k1.

082.

262.

361.

032.

443

924

1 87

51

796

4 09

71

737

Finl

and

1.14

3.55

3.90

1.04

4.04

4 27

01

370

1 24

64

697

1 20

3Fr

ance

1.12

3.70

4.02

1.03

4.13

4 18

91

262

1 16

44

542

1 13

1G

erm

any

1.13

2.20

2.36

1.06

2.49

4 47

32

309

2 15

24

800

2 03

6G

reec

e1.

1910

.29

12.0

41.

0212

.23

3 93

045

438

84

600

382

Hun

gary

1.03

8.41

8.64

1.00

8.67

4 74

358

156

64

870

564

Icel

and

1.11

0.75

0.73

1.15

0.84

3 04

54

470

4 63

32

936

4 03

7Ir

elan

d1.

114.

164.

501.

024.

613

474

927

856

3 76

283

6It

aly

1.16

4.35

4.90

1.03

5.06

4 28

01

147

1 01

84

822

985

Japa

n1.

1117

.53

19.2

81.

0119

.38

4 75

330

027

35

227

271

Kor

ea1.

135.

335.

891.

026.

024

413

935

846

4 87

582

8L

atvi

a1.

208.

6210

.16

1.02

10.3

63

240

451

383

3 81

737

6L

iech

tens

tein

1.10

0.52

0.48

1.20

0.57

286

600

658

261

547

Lux

embo

urg

1.16

0.77

0.73

1.22

0.89

3 04

34

603

4 83

82

893

3 97

0M

exic

o1.

175.

886.

691.

026.

853

945

783

688

4 48

967

1N

ethe

rlan

ds1.

063.

393.

521.

023.

582

369

739

711

2 46

369

9N

ew Z

eala

nd1.

032.

352.

401.

012.

433

549

1 56

01

531

3 61

71

510

Nor

way

1.06

2.85

2.97

1.02

3.03

3 89

51

457

1 39

84

058

1 36

9Po

land

1.16

6.29

7.12

1.02

7.28

3 15

858

151

33

575

502

Port

ugal

1.20

8.30

9.72

1.02

9.91

3 83

655

347

24

495

462

Rus

sian

Fed

erat

ion

1.16

11.7

913

.53

1.01

13.6

95

771

568

495

6 62

249

0Sp

ain

1.17

5.44

6.18

1.03

6.35

5 32

31

143

1 00

56

050

979

Swed

en1.

202.

102.

321.

092.

523

669

2 10

61

903

4 05

91

749

Swit

zerl

and

1.05

10.0

410

.52

1.00

10.5

75

798

607

580

6 07

057

7U

nite

d K

ingd

om1.

095.

555.

971.

026.

078

552

1 68

21

564

9 19

81

540

Uni

ted

Stat

es1.

1015

.82

17.2

91.

0117

.39

3 50

024

322

23

824

221

PIS

A 2

00

0 t

ech

nic

al r

epor

t

144

Tab

le 3

6:

Des

ign

Eff

ects

and

Eff

ecti

ve S

ampl

e Si

zes

for

the

Mea

n Pe

rfor

man

ce o

n th

e M

athe

mat

ical

Lit

erac

y Sc

ale

Des

ign

Des

ign

Des

ign

Des

ign

Des

ign

Eff

ecti

ve

Eff

ecti

veE

ffec

tive

Eff

ecti

ve

Eff

ecti

veE

ffec

t 1

Eff

ect

2E

ffec

t 3

Eff

ect

4E

ffec

t 5

Sam

ple

Sam

ple

Sam

ple

Sam

ple

Sam

ple

Size

1Si

ze 2

Size

3Si

ze 4

Size

5

Aus

tral

ia1.

492.

893.

811.

134.

291

923

991

751

2 53

466

6A

ustr

ia1.

011.

931.

931.

001.

942

620

1 37

01

365

2 63

01

360

Bel

gium

1.12

4.54

4.98

1.03

5.10

3 36

683

476

13

692

742

Bra

zil

1.25

3.14

3.68

1.07

3.93

2 17

586

473

92

544

692

Can

ada

1.12

4.05

4.42

1.03

4.55

14 6

824

072

3 72

616

041

3 62

6C

zech

Rep

ublic

1.03

2.46

2.51

1.01

2.55

2 96

41

246

1 22

13

025

1 20

4D

enm

ark

1.23

1.53

1.65

1.14

1.88

1 93

61

556

1 44

02

090

1 26

4Fi

nlan

d1.

251.

541.

681.

151.

932

163

1 75

11

610

2 35

21

402

Fran

ce1.

211.

992.

191.

092.

402

153

1 30

51

184

2 37

31

082

Ger

man

y1.

061.

621.

651.

031.

712

682

1 74

71

711

2 73

81

656

Gre

ece

1.24

5.60

6.68

1.04

6.91

2 10

846

639

02

516

377

Hun

gary

1.04

4.53

4.66

1.01

4.69

2 70

161

860

12

777

597

Icel

and

1.25

1.06

1.08

1.23

1.33

1 50

51

768

1 74

11

527

1 41

4Ir

elan

d1.

072.

092.

171.

032.

251

984

1 01

697

92

059

948

Ital

y1.

322.

212.

591.

122.

912

101

1 25

01

066

2 46

495

0Ja

pan

1.10

10.6

011

.57

1.01

11.6

72

655

276

253

2 89

925

0K

orea

1.12

2.65

2.84

1.04

2.97

2 47

01

047

974

2 65

693

4L

atvi

a1.

183.

403.

831.

054.

001

826

632

562

2 05

453

7L

iech

tens

tein

1.15

0.81

0.78

1.19

0.93

153

216

224

147

189

Lux

embo

urg

1.11

0.81

0.79

1.14

0.90

1 76

12

415

2 48

01

713

2 17

0M

exic

o1.

183.

604.

061.

044.

232

181

714

633

2 46

060

6N

ethe

rlan

ds1.

082.

172.

271.

042.

351

280

636

610

1 33

458

9N

ew Z

eala

nd1.

141.

821.

931.

072.

071

793

1 12

81

060

1 90

898

8N

orw

ay1.

241.

701.

871.

132.

111

857

1 35

71

234

2 04

21

093

Pola

nd1.

085.

205.

561.

025.

641

823

380

356

1 94

735

0Po

rtug

al1.

104.

634.

981.

025.

072

323

550

511

2 49

750

2R

ussi

an F

eder

atio

n1.

158.

9010

.09

1.02

10.2

43

232

418

369

3 66

436

3Sp

ain

1.03

3.96

4.04

1.01

4.07

3 33

086

684

83

403

841

Swed

en1.

121.

531.

591.

071.

712

207

1 60

91

546

2 29

51

441

Swit

zerl

and

1.20

5.49

6.37

1.03

6.57

2 84

161

853

33

295

517

Uni

ted

Kin

gdom

1.

173.

313.

701.

053.

864

450

1 57

01

406

4 96

81

345

Uni

ted

Stat

es

1.10

11.7

712

.79

1.01

12.8

91

950

181

167

2 11

916

6

Tab

le 3

7:

Des

ign

Eff

ects

and

Eff

ecti

ve S

ampl

e Si

zes

for

the

Mea

n Pe

rfor

man

ce o

n th

e Sc

ient

ific

Lit

erac

y Sc

ale

Des

ign

Des

ign

Des

ign

Des

ign

Des

ign

Eff

ecti

ve

Eff

ecti

veE

ffec

tive

Eff

ecti

ve

Eff

ecti

veE

ffec

t 1

Eff

ect

2E

ffec

t 3

Eff

ect

4E

ffec

t 5

Sam

ple

Sam

ple

Sam

ple

Sam

ple

Sam

ple

Size

1Si

ze 2

Size

3Si

ze 4

Size

5

Aus

tral

ia1.

203.

223.

671.

063.

882

374

889

779

2 70

973

8A

ustr

ia1.

071.

952.

011.

032.

082

500

1 37

01

327

2 58

21

284

Bel

gium

1.03

5.39

5.53

1.01

5.56

3 61

369

067

43

702

670

Bra

zil

1.63

2.16

2.90

1.22

3.53

1 66

01

253

935

2 22

076

8C

anad

a1.

104.

705.

061.

025.

1515

047

3 50

63

260

16 1

813

199

Cze

ch R

epub

lic1.

081.

901.

971.

042.

052

841

1 61

11

554

2 94

61

495

Den

mar

k1.

041.

671.

701.

021.

742

256

1 40

51

383

2 29

21

351

Finl

and

1.24

1.80

1.99

1.12

2.23

2 18

01

510

1 36

32

414

1 21

4Fr

ance

1.25

2.01

2.26

1.11

2.50

2 08

01

290

1 14

82

337

1 03

6G

erm

any

1.22

1.33

1.41

1.16

1.63

2 34

12

142

2 03

12

466

1 75

7G

reec

e1.

026.

516.

601.

006.

612

553

398

393

2 58

739

2H

unga

ry1.

054.

424.

581.

014.

622

678

633

612

2 77

260

6Ic

elan

d1.

031.

101.

111.

031.

141

804

1 68

41

679

1 80

91

634

Irel

and

1.02

2.52

2.55

1.01

2.56

2 09

784

783

82

119

833

Ital

y1.

052.

542.

621.

022.

682

629

1 08

71

054

2 71

21

033

Japa

n1.

179.

1210

.50

1.02

10.6

72

489

320

277

2 86

727

3K

orea

1.22

2.52

2.85

1.08

3.07

2 26

41

095

968

2 56

189

9L

atvi

a1.

056.

807.

081.

017.

132

059

317

305

2 14

230

3L

iech

tens

tein

1.04

0.95

0.95

1.04

0.99

170

185

185

169

178

Lux

embo

urg

1.15

0.98

0.98

1.15

1.13

1 69

81

983

1 98

81

691

1 72

7M

exic

o1.

193.

664.

151.

044.

342

149

696

613

2 43

958

7N

ethe

rlan

ds1.

022.

322.

351.

012.

381

364

601

593

1 38

258

7N

ew Z

eala

nd1.

031.

121.

121.

021.

151

974

1 81

11

805

1 98

01

762

Nor

way

1.06

1.81

1.85

1.03

1.91

2 18

11

279

1 24

62

237

1 20

8Po

land

1.43

3.99

5.28

1.08

5.72

1 42

551

338

71

888

357

Port

ugal

1.03

4.98

5.11

1.01

5.14

2 47

151

349

92

536

496

Rus

sian

Fed

erat

ion

1.14

7.42

8.34

1.02

8.48

3 25

250

144

63

656

438

Spai

n1.

043.

193.

271.

013.

313

339

1 08

31

057

3 42

01

046

Swed

en1.

131.

571.

641.

081.

772

163

1 55

81

488

2 26

51

379

Swit

zerl

and

1.29

5.18

6.40

1.05

6.70

2 62

665

653

13

248

507

Uni

ted

Kin

gdom

1.26

3.07

3.61

1.07

3.88

4 09

91

687

1 43

34

826

1 33

6U

nite

d St

ates

1.12

9.91

11.0

11.

0111

.13

1 89

421

519

32

105

191

Sam

pli

ng

outc

omes

145

Tab

le 3

8:

Des

ign

Eff

ects

and

Eff

ecti

ve S

ampl

e Si

zes

for

the

Perc

enta

ge o

f St

uden

ts a

t L

evel

3 o

n th

e C

ombi

ned

Rea

ding

Lit

erac

y Sc

ale

Des

ign

Des

ign

Des

ign

Des

ign

Des

ign

Eff

ecti

ve

Eff

ecti

veE

ffec

tive

Eff

ecti

ve

Eff

ecti

veE

ffec

t 1

Eff

ect

2E

ffec

t 3

Eff

ect

4E

ffec

t 5

Sam

ple

Sam

ple

Sam

ple

Sam

ple

Sam

ple

Size

1Si

ze 2

Size

3Si

ze 4

Size

5

Aus

tral

ia1.

971.

602.

191.

473.

162

626

3 22

82

366

3 52

21

639

Aus

tria

2.22

1.41

1.92

1.64

3.13

2 14

13

356

2 47

62

897

1 51

5B

elgi

um1.

571.

702.

091.

272.

654

259

3 93

53

193

5 23

72

512

Bra

zil

2.27

2.10

3.50

1.37

4.77

2 15

62

327

1 39

83

584

1 02

6C

anad

a1.

771.

832.

461.

323.

2316

796

16 2

4912

061

22 5

169

194

Cze

ch R

epub

lic2.

911.

121.

362.

413.

271

847

4 77

03

938

2 22

21

642

Den

mar

k1.

781.

141.

251.

632.

032

383

3 71

53

392

2 59

82

091

Finl

and

1.42

1.08

1.11

1.38

1.53

3 42

64

523

4 39

43

518

3 18

6Fr

ance

1.15

1.81

1.93

1.08

2.07

4 07

62

584

2 42

54

340

2 25

4G

erm

any

1.79

1.42

1.75

1.47

2.54

2 83

13

583

2 90

63

462

1 99

9G

reec

e3.

371.

984.

311.

556.

671

388

2 35

51

084

3 00

770

0H

unga

ry1.

572.

623.

541.

164.

113

110

1 86

81

380

4 20

61

188

Icel

and

1.09

1.14

1.15

1.08

1.24

3 10

12

961

2 92

93

132

2 72

3Ir

elan

d1.

941.

181.

341.

732.

281

983

3 28

02

877

2 23

11

688

Ital

y1.

311.

782.

021.

162.

343

795

2 80

22

464

4 29

02

134

Japa

n1.

472.

663.

431.

143.

903

582

1 97

71

530

4 62

61

347

Kor

ea1.

122.

102.

231.

052.

354

452

2 37

62

236

4 72

82

123

Lat

via

2.37

1.47

2.12

1.68

3.49

1 64

42

644

1 83

62

318

1 11

6L

iech

tens

tein

1.80

0.96

0.93

1.89

1.73

174

327

339

167

181

Lux

embo

urg

2.12

1.02

1.04

2.10

2.16

1 66

33

463

3 39

41

677

1 63

2M

exic

o1.

602.

523.

421.

184.

022

877

1 82

81

343

3 90

11

143

Net

herl

ands

1.80

2.10

2.98

1.28

3.79

1 38

71

192

839

1 95

766

1N

ew Z

eala

nd2.

221.

081.

182.

042.

401

650

3 39

03

105

1 79

31

525

Nor

way

1.37

1.04

1.06

1.36

1.43

3 02

83

976

3 91

73

056

2 90

4Po

land

1.44

2.24

2.78

1.16

3.22

2 54

61

630

1 31

43

151

1 13

6Po

rtug

al1.

512.

022.

541.

203.

043

043

2 26

91

807

3 81

71

506

Rus

sian

Fed

erat

ion

1.76

2.15

3.02

1.25

3.78

3 81

83

113

2 21

75

353

1 77

4Sp

ain

1.71

1.50

1.85

1.39

2.57

3 62

44

146

3 35

04

456

2 41

8Sw

eden

2.09

1.03

1.07

2.05

2.15

2 11

44

281

4 14

62

150

2 05

0Sw

itze

rlan

d1.

412.

302.

831.

153.

244

337

2 65

12

155

5 32

31

884

Uni

ted

Kin

gdom

2.39

1.58

2.39

1.60

3.78

3 90

95

904

3 90

75

854

2 47

1U

nite

d St

ates

1.97

1.60

2.18

1.46

3.15

1 95

72

401

1 76

32

638

1 22

2

PIS

A 2

00

0 t

ech

nic

al r

epor

t

146

Sam

pli

ng

outc

omes

147

Tab

le 3

9:

Des

ign

Eff

ects

and

Eff

ecti

ve S

ampl

e Si

zes

for

the

Perc

enta

ge o

f St

uden

ts a

t L

evel

5 o

n th

e C

ombi

ned

Rea

ding

Lit

erac

y Sc

ale

Des

ign

Des

ign

Des

ign

Des

ign

Des

ign

Eff

ecti

ve

Eff

ecti

veE

ffec

tive

Eff

ecti

ve

Eff

ecti

veE

ffec

t 1

Eff

ect

2E

ffec

t 3

Eff

ect

4E

ffec

t 5

Sam

ple

Sam

ple

Sam

ple

Sam

ple

Sam

ple

Size

1Si

ze 2

Size

3Si

ze 4

Size

5

Aus

tral

ia1.

334.

055.

041.

075.

373

900

1 27

91

026

4 85

496

4A

ustr

ia1.

951.

732.

421.

413.

372

433

2 74

41

960

3 36

61

407

Bel

gium

1.61

1.69

2.11

1.29

2.71

4 14

73

953

3 16

85

151

2 45

8B

razi

l4.

081.

231.

872.

834.

951

199

3 96

72

623

1 72

798

9C

anad

a1.

723.

124.

631.

165.

3517

308

9 51

86

407

25 7

015

550

Cze

ch R

epub

lic1.

281.

962.

221.

132.

494

206

2 74

32

418

4 76

52

151

Den

mar

k1.

361.

121.

161.

321.

523

112

3 78

73

649

3 21

42

784

Finl

and

1.86

1.44

1.81

1.47

2.67

2 62

13

382

2 68

23

299

1 82

2Fr

ance

1.50

1.21

1.31

1.38

1.80

3 12

63

876

3 57

53

376

2 59

3G

erm

any

1.32

1.25

1.33

1.25

1.65

3 83

34

062

3 81

64

072

3 06

9G

reec

e1.

203.

504.

001.

054.

203

892

1 33

41

167

4 44

51

112

Hun

gary

1.49

3.91

5.33

1.09

5.81

3 29

01

249

918

4 47

284

1Ic

elan

d2.

091.

001.

002.

102.

101

611

3 36

83

365

1 60

21

609

Irel

and

1.25

1.75

1.94

1.13

2.19

3 08

62

200

1 98

93

412

1 76

2It

aly

1.72

1.45

1.78

1.41

2.51

2 89

03

433

2 79

83

536

1 98

9Ja

pan

1.41

5.14

6.85

1.06

7.27

3 71

81

023

767

4 95

772

3K

orea

1.83

2.01

2.84

1.29

3.67

2 72

22

482

1 75

23

854

1 35

6L

atvi

a1.

243.

003.

471.

073.

713

138

1 30

01

121

3 63

01

048

Lie

chte

nste

in1.

850.

890.

812.

061.

6617

035

139

015

218

9L

uxem

bour

g1.

711.

121.

201.

601.

912

062

3 16

12

945

2 20

41

848

Mex

ico

1.66

1.81

2.34

1.29

3.00

2 76

62

539

1 96

53

566

1 53

1N

ethe

rlan

ds1.

341.

491.

651.

211.

991

875

1 67

91

513

2 07

51

258

New

Zea

land

1.41

1.64

1.90

1.22

2.31

2 60

62

237

1 93

03

018

1 58

9N

orw

ay1.

681.

231.

381.

502.

072

465

3 37

52

997

2 77

02

007

Pola

nd2.

122.

824.

851.

235.

961

727

1 29

775

42

963

613

Port

ugal

1.51

2.20

2.80

1.18

3.31

3 03

82

088

1 63

63

876

1 38

5R

ussi

an F

eder

atio

n1.

612.

954.

141.

154.

754

164

2 27

01

620

5 83

31

412

Spai

n2.

921.

261.

772.

113.

682

129

4 92

83

519

2 94

81

686

Swed

en1.

551.

401.

611.

342.

162

853

3 16

32

737

3 28

62

043

Swit

zerl

and

1.38

5.75

7.57

1.05

7.95

4 40

91

061

806

5 80

676

7U

nite

d K

ingd

om1.

524.

205.

861.

096.

386

147

2 22

31

593

8 57

21

463

Uni

ted

Stat

es1.

604.

115.

981.

106.

582

401

935

643

3 48

858

4

Scaling Outcomes

Ray Adams and Claus Carstensen

InternationalCharacteristics of theItem Pool

When main study data were received from eachparticipating country, they were first verified andcleaned using the procedures outlined in Chapter11. Files containing the achievement data wereprepared and national-level Rasch and traditionaltest analyses were undertaken. The results ofthese analyses were included in the reports thatwere returned to each participant.

After processing at the national level, a set ofinternational-level analyses was undertaken.Some involved summarising national analyses,while others required an analysis of theinternational data set.

The final international cognitive data set (thatis, the data set of coded achievement bookletresponses) (available as intcogn.txt) consisted of174 896 students from 32 participating countries.Table 40 shows the total number of sampledstudents, broken down by participating countryand test booklet.

Test Targeting

Each of the domains was separately scaled to exam-ine the targeting of the tests. Figures 22, 23 and 24show the match between the international itemdifficulty distribution and the internationaldistribution of student achievement for each ofreading, mathematics and science, respectively.1

The figures consist of three panels. The firstpanel, Students, shows the distribution ofstudents’ Rasch-scaled achievement estimates.

Students at the top end of this distribution havehigher achievement estimates than students atthe lower end of the distribution. The secondpanel, Item Difficulties, shows the distributionof Rasch-estimated item difficulties, and thethird panel, Step Difficulties, shows the itemsteps.

The number of steps for an item reported inthe right-hand panel is the number of scorecategories minus 2, the highest step beingreported for all items in the middle panel of thefigure. That leaves one step to be reported in theright-hand panel for a partial credit item scored0, 1, 2, and two steps to be reported for a partialcredit item scored 0, 1, 2, 3. The item score foreach of these items is shown in the right-handpanel in the item label, separated from the itemnumber by a dot point.

In Figure 22, the student achievementdistribution, shown by Xs, is located a littlehigher than the item difficulty distribution. Thisimplies that, on average, the students in the PISAmain study had an ability level that was abovethe level needed to have a 50 per cent chance ofsolving the average item correctly.

Figure 23 shows a plot of item difficulties andstep parameters, together with the studentachievement distribution for mathematics. Thefigure shows that the selected items for the mainstudy, on average, have about a 50 per centchance of being solved correctly by the averagestudent in the tested population.

Chapter 13

1 The results reported here are based on a domain-by-domain unweighted scaling of the full internationaldata set.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

150

Table 40: Number of Sampled Students by Country and Booklet

Country Booklet TotalSE 1 2 3 4 5 6 7 8 9

Australia 566 567 587 593 584 578 581 560 560 5 176Austria 24 539 525 499 525 511 528 527 532 535 4 745

Belgium 136 735 726 747 720 739 713 727 719 708 6 670

Brazil 545 547 545 539 556 553 537 532 539 4 893

Canada 3 246 3 312 3 326 3 243 3 291 3 307 3 336 3 311 3 315 29 687

Czech Republic 159 571 582 594 584 575 570 563 588 579 5 365

Denmark 474 452 482 478 469 459 464 473 484 4 235

Finland 5 546 548 545 531 527 546 536 545 535 4 864

France 501 506 521 517 531 525 528 520 524 4 673

Germany 47 549 574 573 567 551 556 556 556 544 5 073

Greece 516 509 532 522 517 522 514 526 514 4 672

Hungary 162 520 520 521 522 529 529 517 531 536 4 887

Iceland 379 376 380 377 380 363 374 376 367 3 372

Ireland 420 417 430 441 429 427 441 426 423 3 854

Italy 550 555 555 553 551 549 562 555 554 4 984

Japan 585 582 586 584 590 585 581 578 585 5 256

Korea 558 546 556 555 552 553 559 553 550 4 982

Latvia 427 430 430 435 436 436 443 428 428 3 893

Liechtenstein 34 37 36 34 33 33 35 35 37 314

Luxembourg 377 407 388 383 382 340 507 363 381 3 528

Mexico 524 519 515 516 513 498 500 511 504 4 600

Netherlands 23 280 281 267 280 282 282 278 264 266 2 503

New Zealand 426 408 413 405 395 402 404 416 398 3 667

Norway 459 471 465 458 465 461 450 461 457 4 147

Poland 390 429 407 425 408 418 406 410 361 3 654

Portugal 516 508 499 502 500 512 518 518 512 4 585

Russian Federation 742 744 752 748 745 747 743 740 740 6 701

Spain 685 685 682 710 699 700 691 679 683 6 214

Sweden 484 486 496 498 508 484 484 487 489 4 416

Switzerland 677 654 672 677 663 681 692 692 692 6 100

United Kingdom 1 024 1 049 1 042 1 021 1 051 1 031 1 044 1 040 1 038 9 340

United States 441 418 430 429 421 439 425 435 408 3 846

Total 556 19 286 19 370 19 473 19 372 19 383 19 327 19 523 19 360 19 246 174 896

151

+item +item*step-------------------------------------------------------------------------------------| | | | X| | | 3 | | | X|50 | | X| | | X|19 | | XX| | | XX| | | XXX| | | 2 XXXX|68 118 | | XXXXX| | | XXXXXX|38 116 | | XXXXXX|27 129 | | XXXXXXX|62 64 75 | | XXXXXXX|3 43 94 | | XXXXXXX|67 79 80 | | 1 XXXXXXXX|63 82 86 90 | | XXXXXXXXXX|25 28 32 37 77 92 93 99 |25.1 77.1 | XXXXXXX|42 59 117 122 | | XXXXXXXX|2 7 15 17 22 23 30 123 |16.1 | XXXXXXX|13 20 55 60 109 113 120 125 |84.1 110.1 | XXXXXXX|5 16 26 52 83 85 89 97 107 | | XXXXXXX|51 84 100 101 108 114 127 |80.1 | 0 XXXXXXXX|1 8 10 11 41 45 57 74 87 |20.1 90.1 | XXXXX|21 31 33 103 110 119 121 |50.1 82.1 | XXXXXX|12 36 95 112 124 | | XXXX|18 24 44 96 104 126 |15.1 75.1 | XXXX|40 78 88 91 |42.1 | XXXX|49 76 98 106 111 128 | | -1 XXX|4 9 29 69 72 73 |68.1 | XXX|35 48 53 54 58 71 |108.1 | XX|61 66 | | XX|6 |43.1 | X|34 39 65 70 81 102 115 | | X|46 47 56 | | X|14 | | -2 X|105 | | X| | | | | | | | | | | | | | | | | | -3 | | | | | |======================================================================================Each 'X' represents 1125.1 cases======================================================================================

Figure 22: Comparison of Reading Item Difficulty and Achievement Distributions

+item +item*step--------------------------------------------------------------------------------------- | | | | | | 3 | | | X| | | | | | X| | | XX| | | X| | | 2 XX| | | XXX|9 21 30 | | XXXX| | | XXXX|15 | | XXXXX|6 8 26 28 | | XXXXXX|11 | | 1 XXXXX| | | XXXXXXX| | | XXXXXXX|5 | | XXXXXXX|2 13 29 |9.1 20.1 | XXXXXXXX| | | XXXXXXXXXX|18 |6.2 21.1 | 0 XXXXXXXX|31 | | XXXXXXXXX|7 22 |15.1 | XXXXXXXX|27 |6.1 28.1 | XXXXXXXX|4 14 20 |17.1 | XXXXXXXX|3 16 19 | | XXXXXXXXX|23 | | -1 XXXXXX|10 17 | | XXXXXXX| | | XXXX| | | XXXX|1 | | XXXXX|12 | | XXX|25 | | -2 XXX|24 | | XX| | | XX| | | X| | | X| | | X| | | -3 X| | | X| | | | | |=======================================================================================Each 'X' represents 1080.9 cases=======================================================================================

Figure 23: Comparison of Mathematics Item Difficulty and Achievement Distributions

Scal

ing

outc

omes

PIS

A 2

00

0 t

ech

nic

al r

epor

t

152

Figure 24 shows a plot of the item difficultiesand item step parameters, and the studentachievement distribution for science. The figureshows that the selected items for the main study,on average, have about a 50 per cent chance ofbeing answered correctly by the average studentin the tested population.

Test Reliability

A second test characteristic that is of importanceis the test reliability. Table 41 shows thereliability for each of the three overall scales(combined reading literacy, mathematical literacyand scientific literacy) before conditioning andbased upon three separate scalings. Thereliability for each domain and each country,after conditioning, is reported later.2

Figure 24: Comparison of Science Item Difficulty and Achievement Distributions

+item +item*step--------------------------------------------------------------------------------------- | | | | | | 3 | | | | | | | | | | | | X| | | | | | XX| | | 2 XX| | | X| | | XX| | | XXX| | | XXXX|8 10 | | XXXX| |14.1 | XXXXXX|3 | | 1 XXXXXX|14 | | XXXXX|24 | | XXXXXXX|25 33 | | XXXXXXX|7 |8.1 24.1 | XXXXXXX|19 29 32 | | XXXXXXXX|2 12 18 | | XXXXXXXXXX|5 13 21 | | 0 XXXXXXXX|9 | | XXXXXXXX| |2.1 | XXXXXXXX|23 26 30 | | XXXXXXXX|11 31 34 | | XXXXXXXX|1 4 6 17 | | XXXXXXXXX|15 | | XXXXXX| | | XXXXXXX|16 | | -1 XXXX| | | XXXX|22 28 | | XXXXX|20 | | XXX| | | XXX| | | XX| | | XX| | | -2 X| | | X| | | X| | | X| | | |27 | | | | | | | | -3 | | | | | |=======================================================================================Each 'X' represents 1061.1 cases=======================================================================================

2 The reliability index is an internal-consistency-likeindex estimated by the correlation betweenindependent plausible value draws.

Table 41: Reliability of the Three Domains BasedUpon Unconditioned Unidimensional Scaling

Scale Reliability

Mathematics 0.81

Reading 0.89

Science 0.78

Domain Inter-correlations

Correlations between the ability estimates forindividual students in each of the three domains,the so-called latent correlations, as estimated byConQuest (Wu, Adams and Wilson, 1997) aregiven in Table 42. It is important to note thatthese latent correlations are unbiased estimates ofthe true correlation between the underlying latentvariables. As such they are not attenuated by theunreliability of the measures and will generally behigher than the typical product momentcorrelations that have not been disattenuated forunreliability.

Scal

ing

outc

omes

153

Table 42: Latent Correlation Between the ThreeDomains

Scale Reading Science

Mathematics 0.819 0.846

Science 0.890

Reading Sub-scales

A five-dimensional scaling was performed on theachievement (cognitive) data, consisting of:• Scale 1: mathematics items (M)• Scale 2: reading items—retrieving information

(R1)• Scale 3: reading items—interpreting texts (R2)• Scale 4: reading items—reflection and

evaluation (R3)• Scale 5: science items (S).

Table 43 shows the latent correlationsbetween each pair of scales.

Table 43: Correlation Between Scales

Scale R1 R2 R3 S

M 0.854 0.836 0.766 0.846

R1 0.973 0.893 0.889

R2 0.929 0.890

R3 0.840

The correlations between the three readingsub-scales are quite high. The highest is betweenthe retrieving information and interpreting textssub-scales, and the lowest is between theretrieving information and reflection andevaluation sub-scales.

The correlations between the mathematicsscale and the other scales are a little above 0.80,except the correlation with the reflection andevaluation sub-scale, which is estimated to beabout 0.77.

The science scale correlates with themathematics scale at 0.85. It correlates at about0.89 with the first two reading sub-scales and at0.84 with the reflection and evaluation sub-scale.It appears that the science scale correlates morehighly with the reading sub-scales than with themathematics scale.

Item Name Countries

R040Q02 Greece, Mexico

R040Q06 Italy

R055Q03 Austria, Germany, Switzerland (Ger.)

R076Q03 England, Switzerland (Ger.)

R076Q04 England

R076Q05 Belgium (Fl.), Netherlands

R091Q05 Russian Federation, Switzerland (Ger.)

R091Q07B Sweden

R099Q04B Poland

R100Q05 Belgium (Fl.), Netherlands

R101Q08 Canada (Fr.)

R102Q04A Korea

R111Q06B Switzerland (Ger.)

R119Q04 Hungary

R216Q02 Korea

R219Q01T Italy

R227Q01 Spain

R236Q01 Iceland

R237Q03 Korea

R239Q02 Switzerland (Ger.)

R246Q02 Korea

M033Q01 Brazil

M155Q01 Japan, Switzerland (Ger.)

M155Q03 Austria, Switzerland (Ger.)

M155Q04 Switzerland (Ger.)

S133Q04T Austria, Germany, Switzerland (Ger.)

S268Q02T Iceland, NetherlandsS268Q06 Switzerland (Ital.)

Figure 25: Items Deleted for Particular Countries

Scaling OutcomesThe procedures for the national andinternational scaling are outlined in Chapter 9and need not be reiterated here.

National Item Deletions

The items were first scaled by country and theirfit was considered at the national level, as was theconsistency of the item parameter estimates acrosscountries. Consortium staff then adjudicated items,considering the items’ functioning both within andacross countries in detail. Those items consideredto be dodgy (see Chapter 9) were then reviewed in

PIS

A 2

00

0 t

ech

nic

al r

epor

t

154

consultation with National Project Managers(NPMs). The consultations resulted in the deletionof a few items at the national level. The deleteditems, listed in Figure 25, were recoded as notapplicable and were included in neither theinternational scaling nor in generating plausiblevalues.

International Scaling

The international scaling was performed on thecalibration data set of 13 500 students (500 randomly selected students from each of27 countries). The item parameter estimates fromthis scaling are reported in Appendices 1, 2 and 3.

Generating Student Scale Scores

Applying the conditioning approach described inChapter 9 and anchoring all of the itemparameters at the values obtained from theinternational scaling, weighted likelihoodestimates (WLEs) and plausible values weregenerated for all sampled students.3 Table 44gives the reliabilities at the international level forthe generated scale scores.

Table 44: Final Reliability of the PISA Scales

Scale Reliability

Mathematics 0.90

Reading (Combined) 0.93

Reading-R1 0.91

Reading-R2 0.92

Reading-R3 0.90

Science 0.90

Differential ItemFunctioning

In Rasch modelling, an item is considered to exhibitdifferential item functioning (DIF) if the responseprobabilities for that item cannot be fully explainedby the ability of the student and a fixed set of dif-

3 As described later in the chapter, a booklet effectwas identified during national scaling. This meant thatthe planned scaling procedures had to be modifiedand a booklet correction made. Procedures for thebooklet effect correction are described in a latersection of this chapter.

ficulty parameters for that item. In other words,the DIF analysis identifies items that are unusuallyeasy or difficult for one group of students relativeto another. The key is unusually. In DIF analysis,the aim is not to find items with a higher or alower p-value for one group than another—sincethis may result from genuine differences in theability levels of the two groups. Rather, theobjective is to identify items that appear to betoo difficult or too easy, after having controlledfor differences in the ability levels of the twogroups. The PISA main study uses Raschmethods to explore gender DIF.

When items are scaled with the Rasch model,the origin of the scale is arbitrarily set at zeroand student proficiency estimates are thenexpressed on the scale. A student with a score of0 has a probability equal to 0.50 of correctlyanswering an item with a Rasch difficultyestimate of 0. The same student has a probabilityhigher than 0.50 for items with negativedifficulty estimates and a probability lower than0.50 for items with positive difficulty estimates.

If two independent Rasch analyses are run onthe same item set, using the responses from a sub-sample consisting of females only and a sub-sampleof males only, the first analysis estimates the Raschitem difficulty that best fits for females (indepen-dently of males’ data), and the second analysis givesthe analogous estimate for the males. Two indepen-dent difficulty estimates are available. If the malesare better achievers than females on average, rel-ative item difficulty will not necessarily be affected.4

Figures 26, 27 and 28 show a scatter-plot ofthe difficulty parameter estimates for male andfemale students for the three domains respectively.Each item is represented by one point in theplots. The horizontal axis is the item difficultyestimate for females and the vertical axis is theitem difficulty estimate for males. All non-biaseditems are on the diagonal. The further the item isfrom the diagonal, the bigger the gender DIF.

Most of the items, regardless of the domain,show a statistically significant DIF, which is notsurprising, since the standard error of the itemparameters is inversely proportional to the samplesize. Most of the differences between the estimatesof item parameters respectively for males and fe-males are quite small. The graphical displays alsoshow that there are only a small number of outliers.

4 The DIF analysis was performed in one run of themulti-facet model of ConQuest. The Item by Genderinteraction term was added to the standard model.

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3Estimates for Girls

Est

imat

es f

or B

oys

Est

imat

es f

or B

oys

-3

-2

-1

0

1

2

3


-3

-2

-1

0

1

2

3


Est

imat

es f

or B

oys

Figure 26: Comparison of ItemParameter Estimates for Males andFemales in Reading

Figure 27: Comparison of ItemParameter Estimates for Males andFemales in Mathematics

Figure 28: Comparison of ItemParameter Estimates for Males andFemales in Science

Scal

ing

outc

omes

155

PIS

A 2

00

0 t

ech

nic

al r

epor

t

156

Test Length

After the field trial, NPMs expressed someconcern regarding test length. The field trial datawere analysed using three sources ofinformation: not-reached items; field trial timingvariables; and the last question in the field trialStudent Questionnaire, which asked students ifthey had had sufficient time to do the test and,if not, how many items they had had time to do.‘Session’ in the discussion below refers to one oftwo equal parts of the testing time (i.e., eachsession was intended to take one hour).

The results of these analyses were thefollowing:• Based on the pattern of missing and not-

reached items for both sessions, thereappeared to be no empirical evidence tosuggest that two 60-minute testing sessionswere too long. If they were, one would haveexpected that the tired students would havegiven up earlier in the second session than inthe first, but this was not the case.

• Fatigue may also have an effect on studentachievement. Unfortunately, the field trial itemallocation did not permit item anchoringbetween sessions 1 and 2 (except for a fewmathematics items) so that it was not possibleto differentiate any item difficulty effect orfatigue effects.

• The results of the various analyses suggestedthat the field trial instruments were a little toolong. The main determinant of test length (forreading) was the number of words thatneeded to be read. Analysis suggested thatabout 3 500 words per session should be setas an upper limit for the English version. Inthe field trial, test sessions where there weremore words resulted in a substantially largernumber of not-reached items.

• The mathematics and science sessions werealso a bit too long, but this may have beendue to the higher than desired difficulty forthese items in the field trial.The main study instruments were built with

the above information in mind, especiallyregarding the number of words in the test.

For the main study, missing responses for thelast items in each session were also recoded asnot reached. But, if the student was not entirelypresent for a session, the not-reached items wererecoded as not applicable. This means that these

missing data are not included in the resultspresented in this section and that these students’ability estimates were not affected by notresponding to these items when proficiencyscores were computed.

Table 45 shows the number of missing responsesand the number of missing responses recoded asnot reached, by booklet, according to these rules.Table 46 shows this information by country.

Table 45: Average Number of Not-Reached Itemsand Missing Items by Booklet and Testing Session

Session 1 Session 2Booklet Missing Not Missing Not

Reached Reached

0 5.68 1.59 NA NA

1 1.40 0.60 4.70 1.10

2 0.89 0.23 2.44 0.87

3 2.20 0.70 1.97 0.49

4 2.66 1.60 3.11 0.96

5 1.64 1.51 3.95 0.69

6 1.26 0.77 1.77 0.29

7 1.82 0.91 1.75 0.31

8 3.36 1.15 2.91 1.08

9 3.63 0.86 2.49 1.00

Total 2.10 0.93 2.79 0.76

Most of the means of not-reached items are be-low one in the main study, while more than 50 percent of these means were higher than one for thefield trial. As the aver-ages of not-reached itemsper session and per booklet are quite similar, itappears that test length was quite similar acrossbooklets for each session.

Like the field trial, the average number of not-reached items differs from one country toanother. It is worth noting that countries withhigher averages of not-reached items also havehigher averages of missing data.

Tables 47 and 48 provide the percentagedistribution of not-reached items per booklet andper session. The percentage of students who reachedthe last item ranges from 70 to 95 per cent forthe first session and from 76 to 95 per cent forthe second session (i.e., the percentages ofstudents with zero not-reached items).

Table 46: Average Number of Not-Reached Itemsand Missing Items by Country and Testing Session

Session 1 Session 2Country Missing Not Missing Not

Reached Reached

Australia 1.61 0.64 2.23 0.53Austria 2.09 0.53 2.79 0.41Belgium 1.85 0.88 2.53 0.95Brazil 4.16 5.26 5.31 3.98Canada 1.31 0.54 1.80 0.39Czech Republic 2.50 1.18 3.19 0.65Denmark 2.65 1.32 3.57 1.09Finland 1.39 0.46 1.96 0.41France 2.44 1.09 2.99 0.91Germany 2.69 0.98 3.41 0.66Greece 2.95 1.23 4.75 1.40Hungary 2.79 1.67 3.59 1.23Iceland 1.86 0.77 2.58 0.74Ireland 1.31 0.55 1.86 0.48Italy 3.12 1.43 4.14 1.21Japan 2.32 0.80 3.22 0.70Korea 1.28 0.22 1.88 0.29Latvia 3.36 1.56 4.42 1.98Liechtenstein 2.67 0.86 3.50 0.78Luxembourg 3.01 1.57 4.61 1.07Mexico 2.37 1.38 3.10 1.13Netherlands 0.46 0.18 0.70 0.12New Zealand 1.36 0.49 1.98 0.45Norway 2.26 0.69 3.23 0.88Poland 3.16 0.87 4.54 1.00Portugal 2.60 1.13 3.47 0.66Russian Federation 3.01 2.43 3.71 1.88Spain 2.08 1.31 2.93 1.17Sweden 2.08 0.75 2.81 0.73Switzerland 2.34 0.82 3.14 0.59United Kingdom 1.51 0.35 2.25 0.41United States 1.27 0.77 1.95 0.64

Table 47: Distribution of Not-Reached Items by Booklet, First Testing Session

No. of Not- BookletReached Items 0 1 2 3 4 5 6 7 8 9

0 70.1 86.7 95.1 87.4 70.9 79.5 85.9 83.7 81.9 84.11 13.4 2.7 0.3 0.7 3.0 1.2 1.0 2.7 3.9 3.42 0.7 2.3 0.9 1.6 4.8 1.6 0.8 3.9 1.3 1.03 1.3 2.2 0.3 3.9 3.3 4.4 1.7 2.5 0.9 1.84 1.3 1.1 0.7 0.2 2.7 1.0 4.7 0.4 1.8 1.45 1.7 2.2 0.4 0.9 3.8 0.5 1.6 0.5 0.7 2.36 0.7 0.2 1.8 0.4 1.5 1.0 0.6 0.4 0.8 1.5

> 6 10.8 2.6 0.5 4.9 10.0 10.8 3.7 5.9 8.7 4.5

Magnitude of BookletEffects

After scaling the PISA 2000 data for eachcountry separately, achievement scores formathematics, reading and science could becompared across countries and across booklets.(In these analyses, some sub-regions withincountries were considered as countries.)

Tables 49, 50 and 51 present student scalescores for the three domains, standardised tohave a mean of 10 and a standard deviation of 2for each domain and country combination. Thetable rows represent countries (or sub-regionswithin countries) and the columns representbooklets. The purpose of these analyses andtables is to examine the nature of any bookleteffects, not to show countries inorder countries are therefore not named.

The variation in the means between bookletsis greater than was expected. As the scaling wassupposed to equate the booklets and because thebooklets were systematically rotated withinschools, it was expected that the only between-booklet variance would be sampling variance.

The variations observed between booklets arequite stable, however, across countries, leaving apicture of systematically easier and harderbooklets. The booklet means over countries,shown in the last row of each table, indicatemean booklet differences up to 0.52 (one-quarterof a student standard deviation).5

Scal

ing

outc

omes

157

5 Since the Special Education booklet (Booklet 0) wasnot randomly assigned to students, the means wereexpected to be lower than 10.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

158

Table 48: Distribution of Not-Reached Items by Booklet, Second Testing Session

No. of Not- BookletReached Items 1 2 3 4 5 6 7 8 9

0 76.2 78.7 89.6 85.4 87.3 95.1 91.8 86.9 79.61 3.9 12.3 3.1 1.6 3.0 1.4 2.2 1.0 4.22 2.9 0.9 2.0 1.9 0.9 0.4 1.1 1.1 2.43 6.9 0.6 1.3 2.3 1.5 0.2 0.6 1.0 1.44 1.5 1.4 0.5 1.2 1.0 0.5 1.1 1.5 2.85 0.7 0.5 0.3 0.7 1.9 0.2 1.0 0.6 3.16 3.7 0.4 0.3 0.5 1.3 0.3 0.5 1.4 1.2

> 6 4.2 5.2 2.9 6.4 3.1 1.9 1.7 6.5 5.3

Note: Booklet 0 required only one hour to complete and is therefore not shown in this table.

Table 49: Mathematics Means for Each Bookletand Country

Booklet0 1 3 5 8 9 Mean

. 9.70 10.19 9.90 10.43 9.78 10.00

. 9.80 10.26 9.60 10.51 9.78 10.006.09 9.54 10.36 9.68 10.64 9.80 10.00. 9.48 10.33 9.88 10.67 9.64 10.00. 9.62 10.10 9.96 10.57 9.75 10.00. 9.70 10.20 9.75 10.48 9.86 10.00. 9.45 10.21 9.80 10.60 9.95 10.00. 9.60 10.28 9.76 10.59 9.78 10.00. 9.61 10.24 9.83 10.55 9.77 10.006.60 9.83 10.26 9.94 10.47 9.77 10.007.74 9.75 10.27 9.99 10.24 9.95 10.00. 9.73 10.11 9.79 10.56 9.81 10.00. 9.78 9.98 9.83 10.64 9.75 10.00. 9.47 10.07 10.25 10.55 9.64 10.00. 9.85 10.16 9.83 10.39 9.77 10.00. 9.81 9.98 9.63 10.63 9.93 10.00. 9.82 10.12 9.82 10.30 9.94 10.006.79 9.79 10.38 9.88 10.65 10.01 10.00. 9.58 10.29 9.87 10.41 9.85 10.00. 9.74 10.26 9.79 10.30 9.91 10.00. 9.66 10.13 9.79 10.45 9.98 10.00. 9.59 10.39 9.74 10.58 9.70 10.009.48 10.25 9.85 9.93 11.03 9.80 10.007.51 9.90 10.48 9.82 10.61 9.85 10.00. 9.79 10.35 9.75 10.52 9.60 10.00. 9.81 9.98 9.85 10.72 9.68 10.00. 9.72 10.18 9.85 10.30 9.94 10.00. 9.69 10.32 9.85 10.44 9.62 10.00. 9.52 10.41 9.93 10.54 9.61 10.008.59 9.72 10.39 10.01 10.67 9.63 10.00. 9.60 10.27 9.75 10.49 9.89 10.006.36 9.78 10.20 9.87 10.38 9.95 10.007.47 9.72 10.20 10.01 10.59 9.80 10.008.73 9.68 10.23 9.85 10.52 9.81 10.007.54 9.70 10.20 9.86 10.52 9.78Note: Scaled to a mean of 10 and standard deviation of 2.

Each row represents a country or a subnational community that was treated independently in the study.

These differences would affect the abilityestimates of the students who worked on easieror harder booklets, and therefore booklet effectsneeded to be explained and corrected for in anappropriate way in subsequent scaling.

Exploratory analyses indicated that an ordereffect that was adversely affecting the between-booklet equating was the most likely explanationfor this booklet effect. This finding is illustratedin Figure 29, which compares item parameterestimates for cluster 1 obtained from threeseparate scalings—one for each booklet to whichcluster 1 was allocated. For the purposes of thiscomparison, the item parameters for readingwere re-estimated per booklet on the calibrationsample of 13 500 students, (about 1 500 studentsper booklet). To be able to compare the itemparameter estimates, they were centred on cases,assuming no differences in ability across bookletsin the population level. Figure 29 therefore showsthe deviations of each item from the mean para-meter for that item over all booklets in which itappeared. Each item of clusters 1 to 8 was admin-istered in three different booklets, and the itemsin cluster 9 were administered in two differentbooklets. For each item, the differences between thethree (two) estimates and their mean is displayedand grouped into one figure per cluster. The re-maining figures are included in Appendix 6.

Except for the first four items, the items incluster 1 were easier in Booklet 1 than in Booklets 5and 7. The standard errors, not printed in the fig-ures, ranged from about 0.04 to 0.08 for most items.

The legend indicates the positions of thecluster in the booklets: cluster 1 is the first clusterin the first half of Booklet 1 (Pos. 1.1), the

159

Table 50: Reading Means for Each Booklet and Country

Booklet0 1 2 3 4 5 6 7 8 9 Mean

. 10.07 10.43 10.15 9.63 10.04 10.32 9.99 9.64 9.73 10.00

. 10.48 10.18 10.09 9.64 10.02 9.95 9.84 9.68 10.13 10.00

4.00 10.05 10.24 10.13 9.72 10.08 10.11 9.96 9.88 9.87 10.00

. 9.89 10.22 10.22 9.73 10.04 10.14 10.22 9.80 9.74 10.00

. 10.13 10.41 10.12 9.75 10.01 10.17 9.89 9.73 9.78 10.00

. 9.95 10.22 10.01 9.88 9.89 10.15 10.06 9.87 9.97 10.00

. 10.06 10.44 10.11 9.66 9.98 10.27 9.91 9.53 10.03 10.00

. 9.85 10.26 10.21 9.96 10.13 10.11 10.07 9.70 9.71 10.00

. 9.98 10.38 10.14 9.81 10.09 10.05 9.98 9.71 9.88 10.00

4.85 10.18 10.38 10.08 9.82 10.03 10.01 10.14 9.83 9.96 10.00

7.10 10.07 10.07 10.17 9.84 10.11 10.23 9.96 9.81 9.96 10.00

. 10.15 10.31 10.21 9.79 9.99 10.09 10.13 9.52 9.81 10.00

. 10.26 10.36 10.26 9.86 9.90 10.03 9.77 9.75 9.82 10.00

. 9.98 10.44 9.94 9.65 9.98 10.13 10.10 9.90 9.89 10.00

. 10.23 10.21 9.99 9.78 9.89 10.20 9.75 9.92 10.02 10.00

. 10.07 10.39 10.13 9.83 9.86 10.20 9.94 9.84 9.70 10.00

. 10.13 10.23 10.03 9.82 9.92 10.07 10.02 9.80 9.97 10.00

5.94 9.98 10.16 10.00 9.78 10.00 10.25 9.97 10.35 10.42 10.00

. 10.10 10.31 10.10 9.72 9.99 10.07 9.93 9.77 10.00 10.00

. 10.01 10.34 10.14 9.88 10.06 10.00 9.96 9.66 9.93 10.00

. 10.25 10.47 9.87 9.64 9.80 10.18 9.94 9.95 9.91 10.00

. 10.09 10.46 10.08 9.61 9.99 10.08 9.98 9.71 10.00 10.00

8.09 10.35 10.70 10.38 10.07 10.13 10.50 10.39 10.20 10.33 10.00

5.86 10.20 10.45 10.21 9.78 10.10 10.25 10.04 10.00 10.11 10.00

. 10.24 10.58 10.07 9.63 9.94 9.99 9.92 9.51 10.14 10.00

. 10.00 10.50 10.00 9.80 9.99 10.20 9.95 9.84 9.69 10.00

. 10.21 10.51 10.00 9.67 10.01 10.01 10.09 9.71 9.81 10.00

. 10.23 10.42 10.14 9.61 9.98 10.21 10.03 9.56 9.95 10.00

. 9.96 10.29 10.22 9.79 10.00 10.17 10.04 9.68 9.87 10.00

7.95 10.07 10.39 10.24 9.78 10.04 10.28 10.08 9.68 10.09 10.00

. 9.99 10.29 10.12 9.82 9.93 10.15 9.98 9.89 9.82 10.00

5.97 10.12 10.33 10.08 9.92 10.15 10.19 9.85 9.70 9.84 10.00

7.18 10.06 10.42 10.16 9.87 10.05 10.27 10.04 9.82 9.66 10.00

7.48 10.10 10.36 10.11 9.77 10.00 10.14 9.99 9.78 9.94 10.00

6.44 10.11 10.34 10.11 9.78 10.01 10.15 9.99 9.78 9.90

Note: Scaled to a mean of 10 and standard deviation of 2.

Scal

ing

outc

omes

PIS

A 2

00

0 t

ech

nic

al r

epor

t

160

described by a booklet effect, since they vary notonly with respect to the booklets but differentially.

Nevertheless, there is evidence of an ordereffect of some kind. The parameters from thefirst position in the first half are lower than theparameters for other positions for almost everyitem, although the differences are occasionallynegligible. The relation of the other positions(for clusters 1 to 8) is less clear. For clusters 2,4, 5 and 6, the difficulties of position 1.1 and 2.1look similar, whereas position 1.2 gives higheritem parameter estimates. For clusters 1 and 7,positions 1.2 and 2.1 show a similar difficultyand are both higher than in position 1.1. Strangely,for cluster 3 position 1.2 is similar to position 1.1,and position 2.1 looks much more difficult. Thedifferences in clusters 8 and 9 may well beexplained by an order effect. Interestingly, afterhaving some mathematics and science items,position 2.2 looks much harder than after havingreading items only (Booklet 8).

Correction of the Booklet Effect

Modelling the effect

Modelling the order effects in terms of itempositions in a booklet or at least in terms ofcluster positions in a booklet would result in avery complex model. For the sake of handling inthe international scaling, the effect was modelledat booklet level.

Booklet effects were included in themeasurement model to prevent confounding itemdifficulties and booklet effects. For the ConQuestmodel statement, the calibration model was:

item + item*step + booklet.The booklet parameter, formally defined in the

same way as item parameters, reflects bookletdifficulty.

This measurement model provides item para-meter estimates that are not affected by the bookletdifficulties and booklet difficulty parameters thatcan be used to correct student ability scores.

Table 51: Science Means for Each Booklet andCountry

Booklet0 2 4 6 8 9 Mean

. 9.65 9.89 10.29 9.67 10.52 10.00

. 9.66 9.67 10.44 9.39 10.79 10.007.72 9.65 9.92 10.43 9.68 10.34 10.00. 9.68 9.72 10.40 9.58 10.62 10.00. 9.50 9.85 10.34 9.82 10.50 10.00. 9.60 9.89 10.32 9.82 10.38 10.00. 9.50 9.77 10.29 9.93 10.50 10.00. 9.79 9.86 10.21 9.68 10.47 10.00. 9.76 9.78 10.28 9.76 10.40 10.007.07 9.89 9.95 10.23 9.88 10.31 10.008.08 9.64 9.84 10.34 10.07 10.28 10.00. 9.84 9.76 10.13 9.80 10.46 10.00. 9.53 9.59 10.08 10.03 10.78 10.00. 9.44 9.94 10.47 9.68 10.45 10.00. 9.77 9.83 10.25 9.74 10.41 10.00. 9.65 9.79 10.08 10.09 10.47 10.00. 9.85 9.77 10.14 9.94 10.30 10.006.87 9.98 9.91 10.43 10.00 10.39 10.00. 9.69 9.83 10.12 9.80 10.56 10.00. 9.68 10.00 10.12 9.66 10.54 10.00. 9.72 9.67 10.22 10.16 10.22 10.00. 9.93 9.64 10.28 9.77 10.39 10.009.55 9.73 10.03 10.35 9.60 11.01 10.007.39 9.87 9.95 10.47 9.80 10.64 10.00. 9.84 9.68 10.22 9.63 10.64 10.00. 9.57 9.80 10.21 10.03 10.45 10.00. 9.91 9.82 10.18 9.84 10.25 10.00. 9.55 9.81 10.41 9.53 10.70 10.00. 9.64 10.06 10.41 9.45 10.45 10.008.54 9.81 9.74 10.42 9.75 10.71 10.00. 9.57 9.95 10.16 9.92 10.40 10.006.54 9.86 9.71 10.36 9.92 10.30 10.006.70 9.77 9.91 10.15 10.07 10.54 10.008.77 9.72 9.83 10.25 9.81 10.49 10.007.72 9.70 9.82 10.26 9.80 10.47

Note: Scaled to a mean of 10 and standard deviation of 2.

second cluster in the first half in Booklet 5 (Pos1.2), and the first cluster in the second half ofBooklet 7 (Pos 2.1). Items are listed in the orderin which they were administered.

The differences in item difficulties over allnine clusters (see Appendix 6) cannot simply be

Scal

ing

outc

omes

161

Estimating the parametersThe calibration model given above was used toestimate the international item parameters. It wasestimated using the international calibration sampleof 13 500 students, and not-reached items in theestimation were treated as not administered.

The booklet parameters obtained from thisanalysis were not used to correct for the bookleteffect. Instead, a set of booklet parameters wasobtained by scaling the whole international dataset using booklet as a conditioning variable andapplying a specific weight which gives eachOECD country6 an equal contribution7. Further,in this analysis, not-reached items were regardedas incorrect responses, in accordance with thecomputation of student ability scores reported inChapter 9. The students who responded to theSE booklet were excluded from the estimation. Thebooklet parameter estimates obtained are reportedin Table 52. Note that the table reports the devi-ation per booklet from the average within each ofthe three domains over all booklets and that there-fore these parameters add up to zero per domain.

In order to display the magnitude of theeffects, the booklet difficulty parameters arepresented again in Table 53, after theirtransformation into the PISA reporting scalewhich has a mean of 500 and SD of 100.

Table 52: Booklet Difficulty Parameters

Booklet Mathematics Reading Science

1 0.23 -0.05 -

2 - -0.23 0.18

3 -0.14 -0.07 -

4 - 0.12 0.10

5 0.13 -0.01 -

6 - -0.11 -0.13

7 - 0.01 -

8 -0.35 0.20 0.10

9 0.13 0.13 -0.24

Table 53: Booklet Difficulty Parameters Reportedon the PISA Scale

Booklet Mathematics Reading Science

1 17.6 -4.5 -

2 - -20.9 16.2

3 -10.7 -6.4 -

4 - 10.9 9.0

5 9.9 -0.9 -

6 - -10.0 -11.7

7 - 0.9 -

8 -26.7 12.8 9.0

9 9.9 11.8 -21.6

Figure 29: Item Parameter Differences for the Items in Cluster One

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Item

Par

amet

er D

iffe

renc

es

Pos. 1.1 (book 1) Pos 1.2 (book 5) Pos. 2.1 (book 7)

Item

6 Due to the decision to omit the Netherlands fromreports that focus on variation in achievement scores,the scores were not represented in this sample.7 These weights are computed to achieve an equalcontribution from each country. They can be based onany weights used in the countries (e.g., to adjust forthe sampling design) or unweighted data.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

162

Applying the correctionTo correct the student scores for the bookleteffects, two alternatives were considered:• to correct all students’ scores using one set of

the internationally estimated bookletparameters; and

• to correct the students’ scores using one set ofnationally estimated booklet parameters foreach country. These coefficients can beobtained as booklet regression coefficientsfrom estimating the conditioning models, onecountry at a time.All available sets of booklet parameters sum

to zero. Applying these parameters to everystudent in a country (having an equaldistribution of Booklets 1 to 9 in each country)does not change the mean country performance .Therefore the application of any correction doesnot change the country ranking.

With the international correction, all countriesare treated in exactly the same way, and it istherefore the most desirable option from atheoretical point of view. This is the option thatwas implemented.

From a statistical point of view, it leaves someslight deviations from equal booklet difficultiesin the data. These deviations vary over countries.Using the national correction, these deviationswould become negligible.

Outcomes of MarkerReliability Studies

This chapter reports the result of the variousmarker reliability studies that were implemented.The methodologies for these studies aredescribed in Chapter 10.

within-countryReliability Studies

Norman Verhelst

Variance Components Analysis

Tables 54 to 58 show the results of the variancecomponents analysis for the multiply-markeditems in mathematics, science and the threereading sub-scales (usually referred to as scalesfor convenience), respectively. The variancecomponents are each expressed as a percentageof their sum.

The tables show that those variancecomponents associated with markers (thoseinvolving γ) are remarkably small relative to theother components. This means that there are nosignificant systematic within-country markereffects

As discussed in Chapter 10, analyses of thetype reported here can result in negative varianceestimates. If the amount by which thecomponent is negative is small, then this is a signthat the variance component is negligible (nearzero). If the component is large and negative,then it is a sign that the analysis method isinappropriate for the data. In Tables 54 to 58,countries with large inadmissible -estimatesare listed at the bottom of the tables and aremarked with an asterisk. (Some sub-regionswithin countries were considered as countries for

these analyses.) For Poland, some of theestimates were so highly negative that theresulting numbers did not fit in the format of thetables. Therefore Poland has been omittedaltogether.

Generalisability Coefficients

The generalisability coefficients are computedfrom the variance components using:

.(88)

They provide an index of reliability for themultiple marking in each country.

I denotes the number of items and M thenumber of markers. By using different values forI and M, one obtains a generalisation of theSpearman-Brown formula for test-lengthening.In Tables 59 to 63, the formula is evaluated forthe six combinations of I = 8, 16, 24 and M =1, 2, using the variance component estimatesfrom the corresponding tables presented above.For the countries marked with ‘*’ in the abovetables, no values are displayed, because they falloutside the acceptable (0,1) range.

ρσ

σ

σσ σ σ σ σ

ααβ

ααβ γ αγ βγ ε

3

22

22 2 2 2 2( , ).. ..

'Y Y I

I M I M

v v =+

+ ++

++

×

ρ3

Chapter 14

PIS

A 2

00

0 t

ech

nic

al r

epor

t

164

Table 54: Variance Components for Mathematics


Student Item Marker Interaction Interaction Interaction ErrorComponent Component Component Component Component ComponentComponent

Country ( ) ( ) ( ) ( ) ( ) ( ) ( )

Australia 24.99 28.24 0.00 40.54 0.12 0.04 6.07

Austria 15.17 30.81 0.22 45.14 0.17 1.80 6.68

Belgium (Fl.) 23.29 24.86 0.02 46.59 0.06 0.04 5.15

Czech Republic 22.67 27.22 0.01 45.10 -0.02 0.07 4.95

Denmark 19.78 32.69 0.00 42.34 -0.01 0.07 5.14

England 27.49 31.12 0.00 38.10 0.07 0.03 3.20

Finland 15.93 29.44 0.01 51.89 0.00 0.02 2.70

France 18.11 31.27 0.01 43.66 0.08 0.13 6.73

Germany 20.92 26.36 0.00 46.39 -0.01 0.31 6.03

Greece 20.40 23.05 0.02 52.61 0.02 0.04 3.86

Hungary 24.52 23.97 0.01 47.36 -0.30 0.04 4.40

Iceland 22.35 29.47 0.03 43.97 0.10 0.03 4.05

Ireland 22.46 26.31 0.11 41.85 0.18 0.20 8.90

Italy 17.75 24.60 0.01 51.74 0.07 0.14 5.69

Japan 21.35 24.83 0.00 51.90 -0.06 0.02 1.97

Korea 20.32 29.04 -0.01 45.43 0.00 0.14 5.08

Mexico 14.06 19.58 -0.03 54.60 0.70 0.13 10.96

Netherlands 23.66 26.45 0.09 42.09 0.05 0.16 7.49

New Zealand 20.99 29.39 0.03 43.43 0.10 0.16 5.90

Norway 18.66 32.03 0.00 44.55 0.13 0.07 4.56

Portugal 18.56 31.90 -0.01 47.29 0.08 0.05 2.13

Russian Federation 24.39 24.19 0.03 47.17 0.02 0.01 4.18

Spain 19.84 28.64 0.10 44.63 0.19 0.09 6.50

Sweden 21.24 29.35 0.01 44.44 0.08 0.09 4.80

Switzerland (Fr.) 20.80 25.83 0.09 43.71 0.24 0.34 8.99

Switzerland (Ger.) 21.60 28.30 0.00 47.29 0.06 0.01 2.75

Switzerland (Ital.) 19.25 29.62 0.12 41.26 0.82 0.12 8.81

Belgium (Fr.)* 32.85 24.41 0.50 38.04 -29.66 -0.24 34.09

Brazil* 32.59 6.06 0.78 46.99 -32.39 -1.14 47.12

Latvia* 20.29 28.87 0.02 42.83 -1.81 0.26 9.54

Luxembourg* 17.45 23.87 0.03 47.06 -1.79 0.32 13.07

Scotland* 19.66 29.49 0.23 38.00 -3.23 2.10 13.76

United States* 37.21 25.25 1.14 31.21 -42.32 -1.53 49.04

* -coefficients were inadmissible (>1) in these countries.ρ3

σε2σ βγ

2σαγ2σαβ

2σ γ2σ β

2σα2

Ou

tcom

es o

f m

arke

r re

liab

ilit

y st

ud

ies

165

Table 55: Variance Components for Science



Country ( ) ( ) ( ) ( ) ( ) ( ) ( )

Australia 26.38 11.16 -0.12 46.30 0.06 0.88 15.34

Austria 20.47 9.98 0.47 54.53 0.02 0.58 13.95

Belgium (Fl.) 18.30 17.01 0.05 54.99 -0.14 0.11 9.68

Czech Republic 23.13 10.46 0.01 53.53 0.17 0.35 12.34

Denmark 21.77 10.39 0.01 59.94 -0.02 0.10 7.82

England 19.54 9.03 0.23 55.23 0.29 0.86 14.82

Finland 20.49 13.26 0.04 51.44 0.10 0.27 14.41

France 20.47 13.80 0.00 60.20 0.12 0.04 5.37

Germany 22.23 11.72 0.09 52.88 -0.05 0.24 12.89

Greece 21.29 9.42 0.01 60.84 0.27 0.07 8.11

Hungary 22.42 10.11 0.05 49.84 -0.12 0.78 16.92

Iceland 20.17 8.63 0.15 61.15 0.08 0.06 9.76

Ireland 22.56 14.26 0.06 56.78 0.01 0.05 6.29

Italy 20.82 8.77 0.05 65.03 -0.04 0.04 5.33

Japan 19.11 12.01 0.01 65.77 0.07 0.12 2.91

Korea 25.40 10.13 0.07 52.04 0.06 0.17 12.13

Mexico 19.49 14.20 0.05 57.88 0.19 0.08 8.11

Netherlands 23.04 8.57 0.07 56.11 0.05 0.16 12.00

New Zealand 20.05 10.01 -0.02 59.56 0.12 0.22 10.05

Norway 22.20 8.16 0.00 65.35 -0.02 0.01 4.30

Portugal 22.99 12.26 0.02 57.47 0.10 0.08 7.08

Spain 21.12 8.36 0.01 64.99 0.03 0.05 5.44

Switzerland (Fr.) 24.05 7.11 0.06 48.92 0.14 0.66 19.07

Switzerland (Ger.) 22.93 8.67 0.35 55.60 0.14 0.42 11.90


Belgium (Fr.)* 24.13 10.15 0.06 57.55 -0.80 0.22 8.68

Brazil* 21.47 7.49 0.06 58.48 -1.18 0.10 13.57

Latvia* 16.74 7.33 0.22 55.63 -1.30 0.58 20.81

Luxembourg* 25.40 9.63 0.17 50.02 -1.55 0.79 15.54

Russian Federation* 31.16 8.83 1.61 47.67 -23.34 0.83 33.24

Scotland* 25.89 9.64 0.14 54.59 -3.48 0.00 13.22

Sweden* 21.12 6.25 0.92 51.20 -28.85 -0.12 49.49

United States* 34.29 6.96 0.73 47.51 -30.26 -1.37 42.13


σε2σ βγ

2σαγ2σαβ

2σ γ2σ β

2σα2

Table 56: Variance Components for Retrieving InformationStudent- Student- Item- Measure-

Item Marker Marker mentStudent Item Marker Interaction Interaction Interaction Error

Component Component Component Component Component ComponentComponent

Country ( ) ( ) ( ) ( ) ( ) ( ) ( )

Australia 22.40 19.01 -0.02 50.36 0.01 0.22 8.01

Austria 14.97 23.60 0.00 54.93 -0.12 0.24 6.39

Belgium (Fl.) 11.89 22.59 0.15 50.42 0.10 1.43 13.42

Czech Republic 17.14 19.81 0.03 53.73 -0.23 0.46 9.05

Denmark 13.24 24.56 0.01 54.22 0.16 0.25 7.56

England 14.79 22.14 0.00 59.71 0.01 0.00 3.35

Finland 18.97 18.30 0.02 55.93 -0.11 0.07 6.81

France 19.52 19.37 0.37 52.32 1.01 0.11 7.30

Germany 19.28 21.95 0.00 54.03 -0.09 0.03 4.80

Greece 17.92 15.78 0.04 56.92 -0.26 0.16 9.45

Hungary 16.69 17.65 0.02 58.80 -0.01 0.19 6.65

Iceland 16.42 18.09 0.00 62.69 0.07 0.03 2.71

Ireland 16.62 20.04 0.08 53.30 0.08 0.88 9.01

Italy 9.22 22.47 0.04 60.33 -0.10 0.18 7.86

Japan 17.93 15.22 0.19 51.28 -0.08 0.54 14.92

Korea 11.45 23.50 0.02 55.59 -0.13 0.18 9.39

Mexico 20.34 18.28 0.07 54.53 -0.02 0.04 6.76

Netherlands 17.98 20.14 0.00 52.89 -0.81 0.26 9.54

New Zealand 20.29 19.62 0.04 51.82 0.09 0.19 7.94

Norway 15.66 17.79 0.00 61.43 0.21 0.17 4.74

Portugal 14.03 19.05 0.01 61.34 -0.03 0.12 5.48

Spain 18.43 15.34 0.56 42.02 0.10 2.62 20.93

Switzerland (Fr.) 17.17 20.42 0.12 52.69 0.04 0.30 9.25

Switzerland (Ger.) 16.43 23.52 0.02 50.02 0.00 1.32 8.69


Belgium (Fr.)* 35.88 21.17 2.94 38.23 -64.09 -1.95 67.83

Brazil* 21.59 19.19 1.16 49.15 -29.54 -0.37 38.82

Latvia* 27.99 9.17 1.12 51.86 -4.05 0.56 13.34

Luxembourg* 30.64 20.92 2.41 45.20 -57.98 -2.14 60.96


Scotland* 16.70 18.53 0.10 58.15 -1.41 0.06 7.87

Sweden* 14.45 23.71 0.24 53.71 -10.53 0.41 18.00

United States* 29.73 21.35 1.09 42.19 -35.61 -1.23 42.48


σε2σ βγ

2σαγ2σαβ

2σ γ2σ β

2σα2

PIS

A 2

00

0 t

ech

nic

al r

epor

t

166

Table 57: Variance Components for Interpreting TextsStudent- Student- Item- Measure-

Item Marker Marker mentStudent Item Marker Interaction Interaction Interaction Error

Component Component Component Component Component ComponentComponent

Country ( ) ( ) ( ) ( ) ( ) ( ) ( )

Australia 16.46 22.63 0.10 47.56 0.09 0.35 12.80

Austria 15.57 16.43 0.09 52.88 -0.06 1.15 13.94

Belgium (Fl.) 14.90 23.80 0.04 45.18 0.09 0.81 15.18

Czech Republic 16.40 17.76 0.14 49.36 -0.01 1.18 15.16

Denmark 16.88 20.32 -0.02 54.40 0.00 0.28 8.15

England 11.41 27.78 0.02 54.11 -0.08 0.15 6.61

Finland 19.20 18.21 0.04 52.20 0.28 0.43 9.64

France 14.50 20.11 -0.02 53.75 -0.28 0.38 11.55

Germany 15.91 22.73 0.02 50.74 -0.22 0.27 10.55

Greece 13.80 22.63 -0.03 49.56 0.16 0.41 13.47

Hungary 15.62 16.96 0.16 54.81 -0.05 0.73 11.77

Iceland 14.56 22.13 0.05 56.53 -0.14 0.19 6.67

Ireland 12.95 19.45 0.21 50.44 0.07 1.41 15.47

Italy 9.94 22.70 0.07 57.23 -0.12 0.19 9.98

Japan 14.18 9.27 0.48 54.47 0.50 1.03 20.07

Korea 15.03 23.66 0.11 45.35 0.26 0.58 15.01

Mexico 16.29 22.00 0.05 51.38 0.13 0.18 9.97

Netherlands 19.30 15.15 0.02 52.76 -0.42 0.55 12.64

New Zealand 17.61 21.74 0.05 49.03 0.10 0.34 11.13

Norway 13.37 29.90 0.00 47.77 0.04 0.38 8.54

Portugal 13.00 22.16 0.04 56.14 0.23 0.16 8.27

Spain 16.07 16.36 0.83 41.68 0.56 1.42 23.09

Switzerland (Fr.) 15.65 18.91 0.04 50.29 0.22 0.71 14.18

Switzerland (Ger.) 17.49 17.64 0.10 49.75 -0.11 0.68 14.46


Belgium (Fr.)* 35.47 16.09 2.81 41.04 -62.53 -2.78 69.90

Brazil* 20.03 26.28 0.99 40.64 -29.96 0.26 41.77

Luxembourg* 31.46 17.93 2.84 44.04 -53.26 -1.79 58.78

Latvia* 23.28 14.08 0.46 45.50 -4.65 2.95 18.37


Scotland* 14.04 17.09 0.12 57.51 -1.24 0.17 12.30

Sweden* 18.03 13.33 0.14 56.29 -10.16 -0.24 22.60

United States* 25.67 19.30 1.29 43.64 -36.52 -1.22 47.84


σε2σ βγ

2σαγ2σαβ

2σ γ2σ β

2σα2

Ou

tcom

es o

f m

arke

r re

liab

ilit

y st

ud

ies

167

168

Table 58: Variance Components for Reflection and Evaluation



Country ( ) ( ) ( ) ( ) ( ) ( ) ( )

Australia 15.53 25.75 0.41 35.55 0.31 0.75 21.69

Austria 14.90 19.96 0.81 37.20 0.57 1.44 25.13

Belgium (Fl.) 16.26 20.75 0.35 36.20 0.41 1.26 24.77

Czech Republic 15.65 20.78 1.07 32.83 0.69 1.96 27.02

Denmark 14.83 21.54 0.07 49.36 0.05 0.53 13.61

England 13.77 25.57 0.05 47.06 0.02 0.13 13.41

Finland 23.09 15.18 0.21 48.51 0.24 0.27 12.50

France 16.93 21.29 0.40 40.52 0.56 0.52 19.78

Germany 18.32 25.33 0.19 37.09 0.06 0.65 18.35

Greece 13.75 26.20 0.52 32.21 0.41 1.34 25.57

Hungary 15.60 20.52 0.66 39.65 0.16 1.25 22.15

Iceland 16.64 22.44 0.07 49.23 0.01 0.40 11.20

Ireland 14.24 18.67 0.96 41.43 0.34 1.51 22.85

Italy 10.64 26.50 0.50 42.79 0.57 0.99 18.02

Japan 15.13 10.34 1.57 36.68 1.76 1.79 32.73

Korea 11.70 25.36 0.68 29.10 0.78 2.57 29.82

Mexico 18.52 18.75 0.14 40.50 -0.74 0.88 21.95

Netherlands 17.91 24.33 0.37 37.62 0.05 0.45 19.26

New Zealand 14.06 21.79 0.57 43.90 0.50 0.80 18.37

Norway 16.67 21.93 0.28 37.37 -0.23 1.11 22.86

Portugal 18.63 24.58 0.24 34.85 0.50 1.12 20.08

Russian Federation 13.18 25.07 0.07 41.51 0.27 0.59 19.31

Spain 14.97 16.35 1.02 37.29 1.03 1.85 27.50

Switzerland (Fr.) 17.57 19.11 0.77 35.08 0.47 1.86 25.14

Switzerland (Ger.) 17.60 19.74 0.77 37.12 0.61 1.14 23.02


Belgium (Fr.)* 16.39 17.60 1.03 47.94 -7.09 0.58 23.55

Brazil* 32.02 20.54 2.15 36.32 -45.99 -1.72 56.68

Latvia* 27.95 17.69 2.58 42.76 -40.51 -1.18 50.71

Luxembourg* 21.17 17.66 1.16 34.52 -22.78 0.92 47.35

Scotland* 25.18 14.56 1.02 37.26 -2.91 1.08 23.81

Sweden* 21.58 20.05 1.21 35.26 -18.97 1.14 39.73

United States* 24.39 26.13 0.97 31.25 -28.29 -0.71 46.26


σε2σ βγ

2σαγ2σαβ

2σ γ2σ β

2σα2

PIS

A 2

00

0 t

ech

nic

al r

epor

t

Ou

tcom

es o

f m

arke

r re

liab

ilit

y st

ud

ies

169

Table 59: -Estimates for Mathematics

Country I = 8 I = 16 I = 24M = 1 M = 2 M = 1 M = 2 M = 1 M = 2

Australia 0.971 0.986 0.982 0.991 0.986 0.993

Austria 0.935 0.966 0.951 0.975 0.958 0.979

Belgium (Fl.) 0.976 0.988 0.985 0.992 0.988 0.994

Czech Republic 0.978 0.989 0.987 0.994 0.991 0.996

Denmark 0.975 0.987 0.986 0.993 0.990 0.995

England 0.986 0.993 0.991 0.995 0.993 0.996

Finland 0.985 0.992 0.991 0.995 0.993 0.997

France 0.961 0.980 0.976 0.988 0.981 0.991

Germany 0.971 0.985 0.984 0.992 0.989 0.994

Greece 0.981 0.990 0.988 0.994 0.991 0.996

Hungary 0.982 0.991 0.990 0.995 0.993 0.996

Ireland 0.951 0.975 0.967 0.983 0.973 0.986

Iceland 0.978 0.989 0.985 0.992 0.988 0.994

Italy 0.968 0.984 0.979 0.990 0.984 0.992

Japan 0.991 0.996 0.995 0.997 0.996 0.998

Korea 0.976 0.988 0.986 0.993 0.990 0.995

Mexico 0.909 0.952 0.926 0.962 0.934 0.966

Netherlands 0.963 0.981 0.977 0.988 0.982 0.991

New Zealand 0.967 0.983 0.979 0.989 0.984 0.992

Norway 0.972 0.986 0.981 0.990 0.985 0.992

Portugal 0.986 0.993 0.990 0.995 0.992 0.996

Russian Federation 0.981 0.991 0.989 0.994 0.992 0.996

Spain 0.958 0.979 0.970 0.985 0.975 0.987

Sweden 0.974 0.987 0.984 0.992 0.987 0.994

Switzerland (Fr.) 0.946 0.972 0.963 0.981 0.969 0.984

Switzerland (Ger.) 0.985 0.993 0.991 0.995 0.993 0.996Switzerland (Ital.) 0.922 0.960 0.936 0.967 0.941 0.970

ρ3

PIS

A 2

00

0 t

ech

nic

al r

epor

t

170

Table 60: -Estimates for Science


Australia 0.939 0.969 0.965 0.982 0.975 0.987

Austria 0.922 0.959 0.945 0.972 0.954 0.976

Belgium (Fl.) 0.952 0.975 0.970 0.985 0.978 0.989

Switzerland (Fr.) 0.917 0.957 0.948 0.973 0.961 0.980

Switzerland (Ger.) 0.919 0.958 0.950 0.974 0.962 0.981

Czech Republic 0.936 0.967 0.954 0.977 0.962 0.981

Denmark 0.943 0.971 0.966 0.982 0.975 0.987

England 0.967 0.983 0.981 0.990 0.986 0.993

Finland 0.976 0.988 0.985 0.992 0.989 0.994

France 0.932 0.965 0.957 0.978 0.968 0.984

Germany 0.944 0.971 0.965 0.982 0.973 0.986

Greece 0.972 0.986 0.981 0.991 0.985 0.993

Hungary 0.957 0.978 0.969 0.984 0.975 0.987

Iceland 0.972 0.986 0.982 0.991 0.987 0.993

Ireland 0.927 0.962 0.957 0.978 0.969 0.984

Italy 0.950 0.974 0.966 0.983 0.973 0.986

Japan 0.976 0.988 0.985 0.992 0.988 0.994

Korea 0.983 0.992 0.989 0.994 0.991 0.995

Netherlands 0.950 0.975 0.970 0.985 0.977 0.988

New Zealand 0.948 0.973 0.968 0.984 0.976 0.988

Norway 0.955 0.977 0.968 0.984 0.974 0.987

Portugal 0.983 0.991 0.990 0.995 0.993 0.996


Spain 0.914 0.955 0.939 0.968 0.949 0.974

Sweden 0.967 0.983 0.979 0.989 0.984 0.992

ρ3

Table 61: -Estimates for Retrieving Information


Australia 0.965 0.982 0.980 0.990 0.986 0.993

Austria 0.963 0.981 0.978 0.989 0.984 0.992

Belgium (Fl.) 0.896 0.945 0.927 0.962 0.942 0.970

Denmark 0.951 0.975 0.970 0.985 0.978 0.989

England 0.977 0.989 0.987 0.993 0.991 0.995

Finland 0.981 0.990 0.988 0.994 0.991 0.996

France 0.868 0.929 0.908 0.952 0.925 0.961

Germany 0.947 0.973 0.968 0.984 0.977 0.988

Greece 0.967 0.983 0.980 0.990 0.986 0.993

Hungary 0.919 0.958 0.925 0.961 0.928 0.963

Iceland 0.965 0.982 0.978 0.989 0.984 0.992

Ireland 0.953 0.976 0.971 0.985 0.979 0.989

Italy 0.943 0.971 0.962 0.981 0.971 0.985

Japan 0.983 0.992 0.988 0.994 0.990 0.995

Korea 0.941 0.970 0.960 0.980 0.969 0.984

Mexico 0.920 0.958 0.948 0.973 0.960 0.980

Netherlands 0.938 0.968 0.960 0.980 0.970 0.985

New Zealand 0.967 0.983 0.980 0.990 0.985 0.992


Scotland 0.959 0.979 0.974 0.987 0.980 0.990

Spain 0.946 0.972 0.962 0.981 0.969 0.984

Sweden 0.968 0.984 0.980 0.990 0.986 0.993

Switzerland (Fr.) 0.941 0.970 0.958 0.979 0.965 0.982

Switzerland (Ger.) 0.946 0.972 0.964 0.982 0.972 0.986

ρ3

Ou

tcom

es o

f m

arke

r re

liab

ilit

y st

ud

ies

171

Table 62: -Estimates for Interpreting Texts


Australia 0.924 0.961 0.951 0.975 0.962 0.980

Austria 0.918 0.957 0.948 0.973 0.961 0.980

Belgium (Fl.) 0.906 0.951 0.940 0.969 0.955 0.977

Switzerland (Fr.) 0.901 0.948 0.933 0.965 0.946 0.972

Switzerland (Ger.) 0.912 0.954 0.940 0.969 0.953 0.976

Denmark 0.912 0.954 0.944 0.971 0.957 0.978

England 0.942 0.970 0.965 0.982 0.975 0.987

Finland 0.955 0.977 0.971 0.985 0.978 0.989

France 0.827 0.905 0.865 0.927 0.881 0.937

Germany 0.922 0.960 0.952 0.975 0.9640 0.982

Greece 0.942 0.970 0.959 0.979 0.967 0.983

Hungary 0.934 0.966 0.960 0.980 0.971 0.985

Ireland 0.913 0.955 0.943 0.970 0.956 0.977

Iceland 0.929 0.963 0.953 0.976 0.963 0.981

Italy 0.890 0.942 0.923 0.960 0.939 0.968

Japan 0.960 0.979 0.974 0.987 0.981 0.990

Korea 0.927 0.962 0.950 0.975 0.961 0.980

Mexico 0.853 0.921 0.884 0.939 0.898 0.947

Netherlands 0.899 0.947 0.930 0.964 0.943 0.971

New Zealand 0.940 0.969 0.960 0.980 0.968 0.984

Portugal 0.939 0.969 0.964 0.982 0.974 0.987


Scotland 0.937 0.968 0.960 0.979 0.969 0.984

Spain 0.957 0.978 0.975 0.987 0.982 0.990

Sweden 0.938 0.968 0.954 0.976 0.961 0.980

ρ3

PIS

A 2

00

0 t

ech

nic

al r

epor

t

172

Table 63: -Estimates for Reflection and Evaluation


Australia 0.850 0.919 0.893 0.944 0.911 0.954

Austria 0.806 0.893 0.850 0.919 0.869 0.930

Belgium (Fl.) 0.838 0.912 0.886 0.939 0.906 0.951

Denmark 0.786 0.880 0.832 0.908 0.852 0.920

England 0.897 0.946 0.935 0.966 0.950 0.974

Finland 0.918 0.957 0.948 0.973 0.961 0.980

France 0.774 0.873 0.817 0.899 0.835 0.910

Germany 0.835 0.910 0.873 0.932 0.889 0.941

Greece 0.934 0.966 0.954 0.977 0.962 0.981

Hungary 0.863 0.926 0.897 0.946 0.912 0.954

Iceland 0.846 0.917 0.888 0.941 0.906 0.951

Ireland 0.805 0.892 0.858 0.923 0.880 0.936

Italy 0.817 0.899 0.856 0.923 0.873 0.932

Japan 0.937 0.968 0.961 0.980 0.971 0.985

Korea 0.823 0.903 0.855 0.922 0.870 0.930

Mexico 0.721 0.838 0.760 0.864 0.777 0.875

Netherlands 0.736 0.848 0.795 0.886 0.821 0.902

New Zealand 0.887 0.940 0.925 0.961 0.940 0.969

Norway 0.913 0.954 0.962 0.981 0.983 0.991

Portugal 0.867 0.929 0.914 0.955 0.934 0.966


Scotland 0.871 0.931 0.910 0.953 0.925 0.961

Spain 0.918 0.957 0.947 0.973 0.960 0.979

Sweden 0.867 0.929 0.909 0.952 0.927 0.962

Switzerland (Fr.) 0.826 0.905 0.871 0.931 0.889 0.942

Switzerland (Ital.) 0.836 0.910 0.882 0.937 0.901 0.948

ρ3

Ou

tcom

es o

f m

arke

r re

liab

ilit

y st

ud

ies

173

PIS

A 2

00

0 t

ech

nic

al r

epor

t

174

Inter-country RaterReliability Study - Outcome

Aletta Grisay

Overview of the Main Results

Approximately 1 600 booklets were submittedfor the inter-country rater reliability study, inwhich 41 796 student answers were re-scored bythe verifiers. In about 78 per cent of these cases,both the verifier and all four national markersagreed on an identical score, and in another 8.5 per cent of the cases, the verifier agreed withthe majority of the national markers.

Of the remaining cases, 5 per cent hadnational marks considered too inconsistent toallow comparison with the verifier’s mark, and 8 per cent were flagged and submitted toConsortium staff for adjudication. Inapproximately 5 per cent of these cases, theadjudicators found that there was no realproblem (marks given by the national markerswere correct), while 2.5 per cent of the markswere found to be too lenient and 1 per cent too harsh.

Hence, a very high proportion of the cases(91.5 per cent) showed substantial consistencywith the scoring instructions described in theMarking Guides. Altogether only 3.5 per cent ofcases were identified as either too lenient or tooharsh and 5 per cent were identified asinconsistent marks. These percentages aresummarised in Table 64.

Table 64: Summary of Inter-Country RaterReliability Study

Category Percentage

Agreement 91.5

Inconsistency 5.0

Harshness 1.0

Leniency 2.5

Note: Total cases = 41 796

While relatively infrequent, the inconsistent orbiased marks were not uniformly distributedacross items or countries, suggesting that scoringinstructions may have been insufficientlystringent for some items, and that some nationalscoring teams may have been less careful thanothers.

Main Results by Item

The qualitative analysis of the most frequentcases of marking errors may indicate someproblems in the scoring instructions that shedlight on the higher rate of inconsistenciesobserved for these items. For example:• Scoring instructions that were too complex.

For many items where a number of criteriahad to be taken into account simultaneouslywhen deciding on the code, markers oftentended to forget one or more of them. Forexample, in R120Q06: Student Opinions, theresponse (i) must contain some clear referenceto the selected author (Ana, Beatrice, Dieter,or Felix); (ii) must contain details showingthat the student understood the author’sspecific argument(s); (iii) must indicate howthese arguments agreed with the student’s ownopinions about space exploration; and (iv)must be expressed in the student’s own words.National markers often gave answers fullcredit if they commented on Ana’s concernsabout famine and disease or on Dieter’sconcerns about environment, but made noexplicit or implicit reference to spaceexploration. They also tended to give fullcredit to answers that did not sufficientlydiscriminate among two or more of the fiveauthors, or were not expressed in the student’sown words (i.e., they were mere quotations).

• Lack of precision in the scoring instructions.For example, in R119Q05: Gift, neither thescoring instructions nor the examplessufficiently explained the criteria to be appliedto decide on whether the student had actuallycommented on the last sentence, lastparagraph, or on the end of the story as awhole. Many answers that did not refer toany specific aspect described in the lastsentence were given full or partial credit bythe national markers, provided that somereference to the ‘ending’ was included.

• Many errors occurred in the marks given toresponses where the code depended on theresponse given to a previous item (e.g.,R099Q4B: Plan International).

• Cultural differences could explain why anumber of national markers tended to bereluctant to accept the understanding of thetext that was sometimes conveyed by theMarking Guide. In R119Q07: Gift, forexample, the idea that the cries of the panther

were mentioned by the author ‘to create fear’was often accepted, while this interpretationwas considered in the Marking Guide to be aclear case of code 0. In R119Q08: Gift, thenational markers often gave credit to answerssuggesting that the woman fed the panther toget rid of it or to avoid being eaten herself,

while the Marking Guide specified that pity orempathy were the only motivations to beretained for full credit.Table 65 presents detailed results by item.

Nine of the 26 items which are italicised in Table65 had total agreement of less than 90 per cent.

Table 66 gives the stem for each of these items.

Ou

tcom

es o

f m

arke

r re

liab

ilit

y st

ud

ies

175

Table 65: Summary of Item Characteristics for Inter-Country Rater Reliability Study

Percentage

Item Frequency Agreement Inconsistency Harshness Leniency

R070Q04 1618 89.7 6.5 1.5 2.2

R081Q02 1618 96.4 2.0 0.8 0.8

R081Q05 1618 88.5 6.9 1.5 3.2

R081Q06A 1617 87.8 8.1 1.7 2.5

R081Q06B 1618 87.0 8.6 2.0 2.4

R091Q05 1618 99.4 0.3 0.1 0.2

R091Q07B 1570 90.2 5.7 2.1 2.0

R099Q04B 1618 75.7 13.2 3.3 7.7

R110Q04 1618 98.1 1.2 0.3 0.4

R110Q05 1618 95.1 3.2 0.1 1.7

R119Q05 1617 78.3 13.6 3.2 4.9

R119Q07 1618 79.1 12.9 2.8 5.2

R119Q08 1618 85.2 8.5 1.9 4.4

R119Q09T 1618 90.5 5.1 2.1 2.3

R120Q06 1594 86.4 8.8 1.4 3.5

R219Q01E 1618 94.8 3.3 1.2 0.7

R219Q02 1617 95.7 2.7 0.6 1.1

R220Q01 1618 92.3 4.5 1.6 1.6

R236Q01 1569 92.8 2.1 0.7 4.4

R236Q02 1569 93.0 4.0 0.8 2.2

R237Q01 1618 97.2 1.2 0.4 1.2

R237Q03 1618 92.8 4.3 0.5 2.3

R239Q01 1569 99.0 0.6 0.1 0.3

R239Q02 1569 97.9 1.2 0.2 0.7

R246Q01 1617 98.2 0.9 0.4 0.6

R246Q02 1618 92.8 4.1 0.2 3.0

Note: Items in italics have a total agreement of less than 90 per cent.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

176

An analysis was conducted to investigatewhether the items with major marking problemsin a specific country showed particularly poorstatistics for that country in the item analysis.Table 67 presents the main inter-countryreliability results for the group of items wheremore than 25 per cent marking inconsistencieswere found in at least one country, together withfour main study indicators:• the item fit for that country;• whether the item difficulty estimate was

higher or lower than expected (0: no; 1: yes);and

• whether the item had a problem with scoreorder (cases where ability appeared to belower for higher scores (0: no; 1: yes)).Many items for which the inter-country

reliability study identified problems in thenational marking proved to have large fit indices(r = 0.40 between fit and per cent of markinginconsistencies, for N = 26 items × 27 nationalsamples = 702 cases). A moderately highcorrelation was also observed betweenpercentage of inconsistencies and problems inthe category order (r = 0.25).

One might also expect a significantcorrelation between the percentage of too-harshor too-lenient codes awarded by nationalmarkers and possible differential itemfunctioning. Actually the correlations, thoughsignificant, were quite low (r = 0.09 betweenFL_HARSH and HIGH_DELTA; and r = 0.17between FL_LENIE and LOW_DELTA). Thismay be due to the fact that in many countries,when an item showed a high percentage of too-lenient codes, it often also showed non-negligibleoccurrences of too-harsh codes.

Main Results by Country

Table 68 presents the detail of the results bycountry. In most countries (23 of 31), theinternational verifier or the adjudicator agreedwith the national marks in more than 90 per centof cases. The remaining eight countries—thosewith less than 90 per cent of the cases in theagreement category—are italicised in the table.

Four of these eight countries had quite a fewcases in the ‘too inconsistent’ category, butshowed no specific trend towards harshness orleniency.

Table 66: Items with Less than 90 Per Cent Consistency

Item Stem

R070Q04: Beach Does the title of the article give a good summary of the debatepresented in the article? Explain your answer.

R081Q05: Graffiti Why does Sophia refer to advertising?

R081Q06A: Graffiti Which of the two letter writers do you agree with? Explain youranswer by using your own words to refer to what is said in one orboth of the letters.

R081Q06B: Graffiti Which do you think is the better letter? Explain your answer byreferring to the way one or both letters are written.

R099Q04B: Plan International What do you think might explain the level of Plan International’sactivities in Ethiopia compared with its activities in other countries?

R119Q05: Gift Do you think the last sentence of “The Gift” is an appropriate ending?

R119Q07: Gift Why do you think the writer chooses to introduce the panther withthese descriptions?

R119Q08: Gift What does the story suggest was the woman’s reason for feeding thepanther?

R120Q06: Student Opinions Which student do you agree with most strongly?

Ou

tcom

es o

f m

arke

r re

liab

ilit

y st

ud

ies

177

Table 67: Inter-Country Reliability Study Items with More than 25 Per Cent Inconsistency Rate in at LeastOne Country

Country Item Inconsistency Harshness Leniency Fit High Low Abilities(%) (%) (%) Mean Fit Fit Not

Square Ordered

Austria R081Q06A 29.2 4.2 0.0 0.99 0 0 0

Canada (Fr.) R081Q06B 25.5 0.0 2.1 0.94 0 0 0

Canada (Fr.) R119Q07 29.8 0.0 2.1 1.36 0 0 1

Czech Republic R099Q04B 33.4 4.2 2.1 1.15 0 1 1

Czech Republic R119Q05 41.7 4.2 8.3 1.15 0 0 0

Czech Republic R119Q07 52.1 2.1 4.2 1.28 0 0 1

Denmark R070Q04 37.5 0.0 8.3 1.02 0 0 0

Denmark R099Q04B 62.6 0.0 18.8 1.19 0 0 0

Denmark R119Q05 27.1 0.0 10.4 1.24 0 0 0

Greece R099Q04B 35.5 2.1 14.6 1.32 0 1 0

Iceland R236Q01 62.5 0.0 62.5 0.97 0 1 0

Ireland R119Q05 39.6 0.0 2.1 1.22 0 0 1

Japan R099Q04B 35.4 4.2 22.9 1.19 0 0 0

Japan R119Q07 33.3 0.0 12.5 1.20 1 0 0

Korea R099Q04B 45.9 6.3 31.3 1.20 0 1 0

Korea R119Q05 37.5 8.3 16.7 1.04 1 0 0

Korea R119Q07 29.3 4.2 18.8 1.16 1 0 0

Mexico R091Q07B 39.6 22.9 12.5 1.17 1 0 0

Mexico R099Q04B 31.2 2.1 20.8 1.21 0 1 0

Netherlands R119Q08 39.6 0.0 12.5 1.28 0 0 1

Portugal R119Q05 25.1 2.1 4.2 1.23 0 0 0

Russian FederationR099Q04B 43.8 4.2 33.3 1.18 0 1 0

Russian FederationR119Q05 37.5 2.1 27.1 1.25 0 1 0

Russian FederationR119Q07 31.3 4.2 20.8 1.16 0 1 0

Sweden R081Q06B 27.2 6.3 4.2 0.96 1 0 0

United States R119Q07 29.1 8.3 8.3 1.33 0 1 1

However, the rate of ‘too lenient’ cases was 5 per cent or more in four countries: France (5 per cent), Korea (5 per cent), the RussianFederation (7 per cent) and Latvia (9 per cent).The details provided by the country reports seemto suggest that, in all four countries, scoringinstructions were not strictly followed for anumber of items. In the Russian Federation, forfive of the 26 items in Booklet 7, more than 10per cent of cases were marked too leniently. InKorea, more than 10 per cent of cases for oneitem were marked too harshly, while six otheritems were too leniently marked. In Latvia, four

items received very systematic lenient marks(between one-third and one-half of studentanswers were marked too leniently for theseitems), while a few others had both highproportions of cases which were either tooharshly or too leniently marked.

In both Russia and Latvia, there was someindication that, for a number of partial credit items,the markers may have used codes 0, 1, 2 and 3in a non-standard way. They gave code 3 to whatthey thought were the ‘best’ answers, 2 to answersthat seemed ‘good, but not perfect’, 1 to ‘poor,but not totally wrong’ answers, and 0 to ‘totally

wrong’ answers, without considering the specificscoring rules for each code provided in theMarking Guide. In Latvia, the NPM had decidednot to translate the Marking Guide into Latvian,on the assumption that all the markers weresufficiently proficient in English to use theEnglish source version.

Ten other countries had leniency problems forjust one or two items sometimes due to residualtranslation errors. In some of these cases, theproportion of incorrect marks may have beensufficient to affect the item functioning.

In general, harshness problems were veryinfrequent.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

178

Table 68: Inter-Country Summary by Country

Percentage

Country Frequency Consistent Inconsistent Harsh Lenient

Australia 1 248 92.4 5.6 1.4 0.6Austria 1 247 91.6 6.4 1.1 0.9Belgium (Fl.) 1 078 85.6 9.6 1.8 3.0Belgium (Fr.) 1 248 92.2 4.9 1.6 1.3Brazil 1 248 91.5 4.6 0.6 3.4Canada (Eng.) 1 248 91.4 6.7 0.7 1.2Canada (Fr.) 1 222 91.0 6.9 0.6 1.6Czech Republic 1 248 86.9 9.5 0.6 3.0Denmark 1 248 86.8 9.0 0.5 3.8England 1 248 93.3 5.0 0.4 1.3Finland 1 248 96.1 2.4 0.6 1.0France 1 196 80.2 12.5 2.3 5.0Germany 1 224 92.6 5.8 0.7 0.8Greece 1 247 90.5 3.3 1.7 4.6Hungary 1 274 92.9 2.3 2.4 2.4Iceland 1 248 91.7 5.4 0.3 2.6Ireland 1 248 92.5 5.0 0.6 1.8Italy 1 170 89.8 7.0 1.5 1.7Japan 1 248 95.0 1.8 0.7 2.5Korea 1 248 89.3 3.9 1.6 5.1Luxembourg 1 144 95.9 2.1 1.6 0.4Latvia 1 066 80.6 7.5 2.5 9.4Mexico 1 247 91.1 3.5 2.8 2.6Netherlands 1 248 91.3 5.8 1.4 1.5New Zealand 1 222 96.5 2.7 0.1 0.7Norway 1 248 92.9 4.4 0.6 2.1Poland 1 248 92.1 3.0 1.0 3.8Portugal 1 248 92.0 5.4 1.8 0.8Russian Federation 1 248 88.5 3.7 0.9 6.9Scotland 1 248 94.5 3.8 0.6 1.1Spain 1 248 94.1 3.4 0.8 1.8Sweden 1 200 93.2 4.0 1.6 1.3Switzerland 1 300 93.4 4.0 2.4 0.2United States 1 247 92.1 5.2 1.4 1.3

Note: Countries with less than 90 per cent of cases in the agreement category are in italics.

Data Adjudication

Ray Adams, Keith Rust and Christian Monseur

In January 2001, the PISA Consortium, theSampling Referee and the OECD Secretariat metto review the implementation of the survey ineach participating country and to consider:• the extent to which the country had met PISA

sampling standards;• the outcomes of the national centre and school

quality monitoring visits;• the quality and completeness of the submitted

data;• the outcomes of the inter-country reliability

study; and• the outcomes of the translation verification

process.The meeting led to a set of recommendations

to the OECD Secretariat about the suitability ofthe data from each country for reporting and usein analyses of various kinds, and a set of follow-up actions with a small number of countries.

This chapter reviews the extent to which eachcountry met the PISA 2000 standards. For eachcountry, a recommendation was made on the useof the data in international analyses and reports.Any follow-up data analyses that needed to beundertaken to allow the Consortium to endorsethe data were also noted.

Overview of Response RateIssues

Table 69 presents the results of some simplemodelling that was undertaken to consider theimpact of non-response on estimates of means. Ifone assumes a correlation between a school’spropensity to participate in PISA and studentachievement in that school, the bias in mean

scores resulting from various levels of non-response can be computed as a function of thestrength of the relationship between thepropensity to non-response and proficiency. 1

The rows of Table 69 correspond to differingresponse rates, and the columns correspond todiffering correlations between propensity andproficiency. The values give the mean scores thatwould be observed for samples with matchingcharacteristics.

The true mean and standard deviation wereassumed to be 500 and 100 respectively. InPISA, the typical standard error for a samplemean is 3.5. This means that any values in thetable of approximately 506 or greater have abias greater than a difference from the true valuethat would be considered statistically significant.Those values are shaded in Table 69.

While the true magnitude of the correlationbetween propensity and achievement is notknown, standards can be set to guard againstpotential bias by non-response. The PISA schoolresponse rate of 0.85 was chosen to protectagainst any substantial relationship betweenpropensity and achievement.

1 The values reported in Table 69 were computed usinga model that assumes a bivariate normal distributionfor propensity to respond and proficiency. Theexpected means were computed under the assumptionthat proficiency is only observed for schools whosepropensity to non-response is below a value that yieldsthe specified response rate.

Chapter 15

PIS

A 2

00

0 t

ech

nic

al r

epor

t

180

Figure 30 is a plot of the attained PISA schoolresponse rates. It shows a scatter-plot of theschool response rates (weighted) that wereattained by the PISA participants. Those countriesthat are plotted in the shaded region wereregarded as fully satisfying the PISA schoolresponse rate criterion.

As discussed below, each of the six countries(Belgium, the Netherlands, New Zealand, Poland,the United Kingdom and the United States) thatdid not meet the PISA criterion were asked toprovide supplementary evidence that would assistthe Consortium in making a balanced judgementabout the threat of non-response to the accuracyof inferences which could be made from the PISAdata.

The strength of the evidence that theConsortium required to consider a country’ssample as sufficiently representative is directlyrelated to distance of the achieved response ratefrom the shaded region in the figure.

Table 69: The Potential Impact of Non-Response on PISA Proficiency Estimates

Correlation Between Retentivity and ProficiencyResponse

Rate 0.08 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

0.05 517 521 541 562 583 603 624 644 665 686 706

0.10 514 518 535 553 570 588 605 623 640 658 675

0.15 512 516 531 547 562 578 593 609 624 640 655

0.20 511 514 528 542 556 570 584 598 612 626 640

0.25 510 513 525 538 551 564 576 589 602 614 627

0.30 509 512 523 535 546 558 570 581 593 604 616

0.35 508 511 521 532 542 553 563 574 585 595 606

0.40 508 510 519 529 539 548 558 568 577 587 597

0.45 507 509 518 526 535 544 553 562 570 579 588

0.50 506 508 516 524 532 540 548 556 564 572 580

0.55 506 507 514 522 529 536 543 550 558 565 572

0.60 505 506 513 519 526 532 539 545 552 558 564

0.65 505 506 511 517 523 528 534 540 546 551 557

0.70 504 505 510 515 520 525 530 535 540 545 550

0.75 503 504 508 513 517 521 525 530 534 538 542

0.80 503 503 507 510 514 517 521 524 528 531 535

0.85 502 503 505 508 511 514 516 519 522 525 527

0.90 502 502 504 506 508 510 512 514 516 518 519

0.95 501 501 502 503 504 505 507 508 509 510 511

0.99 500 500 501 501 501 501 502 502 502 502 503

Follow-up Data AnalysisApproach

After reviewing the sampling outcomes, theConsortium asked countries falling below thePISA school response rate standards to provideadditional data or evidence to determine whetherthere was a potential bias. Before describing thegeneral statistical approach that the Consortiumadopted, it is worth noting that the existence ofa potential bias can only be assessed for thevariables used in the analyses and does notexclude that the sample is biased with respect toanother variable.

In most cases, the Consortium worked withnational centres to undertake a range of logisticregression analyses. Logistic regression is widelyused for selecting models for variables with onlytwo outcomes. In the analysis recommended bythe Consortium, the dependent variable is theresponse or non-response of a particular sampledschool. Suppose, for simplicity’s sake, that onlytwo characteristics are known for theschool region and urban status. Let be theresponse outcome for the kth school in region i

( ) with urban status j (j = 0,1). Thendefine:

= 1 if school k participates in thesurvey,

= 0 if school k does not participatein the survey.

Under the quasi-randomisation model (see, forexample, Oh and Scheuren, 1983), response ismodelled as another phase of sampling,assuming that the are independent randomvariables, with probability for outcome 1and probability for outcome 0. is theschool’s probability of response, or its responsepropensity.

The logistic regression model goes a step beyondthis to model and is specified as follows:

,

which is algebraically equivalent to:

.pijki j

i j

=+ +

+ + +exp( )

exp( )

µ α βµ α β1

lnp

pijk

ijki j1 −

= + +µ α β

pijk

pijk1 − pijk

pijk

Yijk

Yijk

i = …1 4, ,

Yijk

Dat

a ad

jud

icat

ion

181

Figure 30: Plot of Attained PISA School Response Rates

Scho

ol R

espo

nse

Rat

e A

fter

Rep

lace

men

t

School Response Rate Before Replacement

United States

United Kingdom

Netherlands

Poland

BelgiumNew Zealand

100

90

80

70

60

50

40

30

20

10

0

0 10 20 30 40 50 60 70 80 90 100

The quantity is the odds that the

school participates. A logarithm gives a multi-plicative rather than an additive model that isalmost universally preferred by practitioners(see, for example, McCullagh and Nelder, 1983).The quantities on the right-hand side of theequation define parameters that are thenestimated using maximum likelihood methods.

There is concern that non-response bias maybe a significant problem if any of the or terms is substantial since this indicates thatregion and/or urban status are associated withnon-response.

Countries occasionally also provided otherstatistical analyses in addition to or in place ofthe logistic regression analysis.

Review of Compliancewith Other Standards

It is important to recognise that the PISA dataadjudication is a late but not necessarily finalstep in a quality assurance process. By the timeeach country was adjudicated, quality assurancemechanisms (such as the sampling proceduresdocumentation, translation verification, datacleaning and site visits) had identified a range ofissues and ensured that they had beenrectified at least in the majority of cases.

Details on the various quality assuranceprocedures and their outcomes are documentedelsewhere (see Chapter 7 and Appendix 5). Dataadjudication focused on residual issues thatremained after these quality assurance processes.There were not many such issues and theirprojected impact on the validity of the PISAresults was deemed to be negligible.

Unlike sampling issues, which under mostcircumstances could directly affect all of acountry’s data, the residual issues identified inother areas have an impact on only a smallproportion of the data. For example, markingleniency or severity for a single item in readinghas an effect on between just one-third and one-half of 1 per cent of the reading data and evenfor that small fraction, the effect would beminor.

Detailed CountryComments

Summary

Concerns with sampling outcomes and complianceproblems with PISA standards resulted in recom-mendations to place constraints on the use of thedata for just two countries the Netherlands andJapan. The Netherlands’ response rate was verylow. It was therefore recommended to exclude thedata from the Netherlands from tables and chartsin the international reports that focus on the levelor distribution of student performance in one ormore PISA domain, or on the distribution of acharacteristic of schools or students. Japan sampledintact classes of students. It was recommended,therefore, that data from Japan not be includedin analyses that require disentangling variancecomponents.

Two countries—Brazil and Mexico—had note-worthy coverage issues. In Brazil, it was deemed un-reasonable to assess 15-year-olds in very low grades,which resulted in about 69 per cent coverage of the15-year-old population. In Mexico, just 52 per centof 15-year-olds are enrolled in schools muchlower than all other participating countries.

In two countries—Luxembourg and Poland—higher than desirable exclusions were noted.Luxembourg had a school-level exclusion rate of9.1 per cent. The majority of these excludedstudents (6.5 per cent) were instructed in languagesother than the PISA test languages. In Poland,primary schools accounting for approximately 7 per cent of the population were excluded.

For all other participants, the problemsidentified and discussed below were judged to besufficiently small that their impact oncomparability was negligible.

Australia

A minor problem occurred in Australia in thatsix (of 228) schools would not permit a sampleto be selected from among all PISA-eligiblestudents, but restricted it in some way, generallyby limiting it to those students within the modalgrade. Weighting adjustments were made topartially compensate for this. This problem wastherefore not regarded as significant andAustralia fully met the PISA standards, so thatinclusion in the full range of PISA reports wasrecommended.

β jα i

p

pijk

ijk1 −

PIS

A 2

00

0 t

ech

nic

al r

epor

t

182

Austria

In Austria, students in vocational schools areenrolled on a part-time/part-year basis. Sincethe PISA assessment was conducted only at onepoint in time, these students are captured onlypartially. Thus, it is not possible to assess howwell the students sampled from vocationalschools represent the universe of studentsenrolled in vocational schools, and so thosestudents not attending classes a the time of thePISA assessment are not represented in the PISAresults.Additionally, Austria did not follow the correctprocedure for sampling small schools. This ledto a need to trim the weights. Consequently,small schools may be slightly under-represented.It was concluded that any effect of this under-representation would be very slight, and the datawere therefore suitable for inclusion in the fullrange of PISA reports.

Belgium

The school-weighted participation rate for theFlemish Community of Belgium was quite low:61.5 per cent before replacement and 79.8 per centafter replacement. However the French Communityof Belgium was very close to the internationalstandard before replacement (79.9 per cent) andeasily reached the standard after replacement(93.7 per cent). For Belgium as a whole, theweighted school participation rate before replace-ment was 69.1 per cent and 85.5 per cent afterreplacement. Belgium did, therefore, not quitemeet the school response rate requirements.

The German-speaking community, whichaccounts for about 1 per cent of the PISApopulation, did not participate.2 As the schoolresponse rate was below the PISA 2000standard, additional evidence was sought fromthe Flemish Community of Belgium with regardto the representativeness of its sample. TheFlemish community provided additional datawhich included a full list of schools (samplingframe) containing the following information oneach school:

Dat

a ad

jud

icat

ion

183

2 In fact, a sample of students from the German-speaking community was tested, but the data were notsubmitted to the international Consortium.

• whether it had been sampled;• whether it was a replacement;• its stratum membership;• participation status; and• percentage of over-aged students.

In Belgium, compulsory education starts forall children in the calendar year of the child’ssixth birthday. Thus, the student’s expected gradecan be computed from knowing the birth date.

The Belgian educational systems use graderetention to increase the homogeneity of studentabilities within each class. In most cases, a studentwho does not reach the expected performance levelwill not graduate to the next grade. Therefore, acomparison of the expected and actual gradeprovides an indication of student performance.

Table 70 gives the performance on thecombined reading literacy scale for the threemost frequent grades in the Flemish studentsample and the percentage of 15-year-oldsattending them.

Table 70: Performance in Reading by Grade in theFlemish Community of Belgium

Grade Percentage Readingof Students Performance

8 2.5 363.59 22.8 454.7

10 72.2 564.3

The unweighted correlation between schoolmean performance for the combined readingliteracy scale and the percentage of over-agedstudents is equal to –0.89. The school percentageof over-aged students constitutes a powerfulsurrogate of the school mean performance, andwas therefore used to see if student achievementin responding schools was likely to besystematically different from that in non-responding schools.

Table 71 shows the mean of the school over-aged percentages according to schoolparticipation status and sample status (originalor replacement). The data used to compute theseresults are based only on regular schools. Datafrom special education schools and part-timeeducation schools were not included because alloriginally sampled students from these strataparticipated in PISA.

The data included in Table 71 show thatschools with lower percentages of over-agedstudents were more likely to participate thanschools with higher percentages of over-agedstudents. It also appears that the addition ofreplacement schools seems to have reduced thebias introduced by the original schools thatrefused to participate.

Further, additional analyses showed that thebias was correlated with stratification variables,that is the type of school (academic, technical orvocational) and the school board. The school-level non-response adjustment applied duringweighting would thus have reduced the bias.

Given the marginal amount by which Belgiumdid not meet the sampling criteria, and thesuccess of correcting for potential bias byreplacements and non-response adjustments, itwas concluded that there was no cause forconcern about the potential of significant non-response bias in the PISA sample for Belgium.

It was also noted that the Flemish Communityof Belgium did not adhere strictly to the requiredsix-week testing window. All testing sessionswere conducted within a three-month testingperiod and therefore none violated the definitionof the target population. Variance analyses wereperformed according to the testing period andno significant differences were found.

It was concluded that, while Belgium deviatedslightly from PISA standards, it could berecommended that the data be used in the fullrange of PISA analyses and reports.

Brazil

In Brazil, 15-year-olds are enrolled in a widerange of grades. It was not deemed appropriate

PIS

A 2

00

0 t

ech

nic

al r

epor

t

184

to administer the PISA test to students in grades1 to 4. This decision was made in consultationwith Brazil and the Consortium and resulted inabout 69 per cent coverage of the 15-year-oldpopulation.

For many secondary schools in Brazil, anadditional sampling stage was created betweenthe school and student selection. Manysecondary schools operate in threeshifts morning, afternoon and evening. Anapproved sampling procedure was developedthat permitted the selection of one shift (or twoif there was a small shift) at random, with 20students being selected from the selected shifts.

Weighting factors were applied to the studentdata to reflect this stage of sampling. Thus, if oneshift was selected in a school, the student datawere assigned a factor of three in addition to theschool and student base weight components. Afterweighting, the weighted total student populationwas considerably in excess of the PISA populationsize as measured by official enrolment statistics. Itwas concluded that, in many cases, either or bothof the following occurred: the shift sampling wasnot carried out according to the procedures; orthe data on student numbers within each schoolreflected the whole school and not just theselected shifts. Since which errors occurred inwhich schools could not be determined, theseproblems could not be addressed in weighting.However, there was no evidence of a systematicbias in the sampling procedures.

Canada

Canada fully met the PISA standards, andinclusion in the full range of PISA reports wasrecommended.

Czech Republic

The Czech Republic fully met the PISAstandards, and inclusion in the full range of PISAreports was recommended.

Denmark

Denmark fully met the PISA standards, andinclusion in the full range of PISA reports wasrecommended.

Table 71: Average of the School Percentage ofOver-Aged Students

School Sample and Number MeanParticipation Status of Schools Percentage

Original sample 145 0.288

All participating schools 119 0.262

Participating schools from the original sample 88 0.255

Non-participating schools from the original sample 57 0.340

Finland

Finland fully met the PISA standards, andinclusion in the full range of PISA reports wasrecommended.

France

France did not supply data on the actual numberof students listed for sampling within each school,as required under PISA procedures for calculatingstudent weights. Rather, France provided thecurrent year enrolment figures for enrolled 15-year-olds as recorded at the Ministry. The datashowed that the two quantities were sometimesslightly different, but no major discrepancieswere detected. Thus, this problem was notregarded as significant.

Results from the inter-country reliability studyindicated an unexpectedly high degree of variationin national ratings of open-ended items. The inter-country reliability study showed that agreementbetween national markers and internationalverifiers was 80.2 per cent—the lowest level ofagreement among all PISA participants. Of theremaining 19.8 per cent, 12.5 per cent of theratings were found to be inconsistent, 2.3 per centtoo harsh and 5.0 per cent too lenient.

Given that this rater inconsistency was likelyto affect only a small proportion of items, andbecause the disagreement included both harshnessand leniency deviations, it was concluded thatno systematic bias would exist in scale scoresand that the data could be recommended forinclusion in the full range of PISA reports.

Germany

Germany fully met the PISA standards, andinclusion in the full range of PISA reports wasrecommended.

Greece

The sampling frame for Greece had deficientdata on the number of 15-year-olds in eachschool. This resulted in selecting many schools inthe sample with no 15-year-olds enrolled, or thatwere closed (34 of a sample of 175 schools).Often the replacement school was then included,which was not the correct PISA procedure forsuch cases. Also quite a few sampled schools(a further 34) had relatively few 15-year-olds.

Rather than being assessed or excluded, theseschools were also replaced, often by larger schools.This resulted in some degree of over-coverage inthe PISA sample, as there were 18 replacementschools in the sample for schools that should nothave been replaced, and 32 replacements forsmall schools where the replacement was generallylarger than the school it replaced. It is not possibleto evaluate what kinds of students might beover-represented, but since the measures of sizewere unreliable, this is unlikely to result in anysystematic bias. Thus, this problem was notregarded as significant, and inclusion in the fullrange of PISA reports was recommended.

Hungary

Hungary did not strictly follow the correctprocedure for the sampling of small schools.However, this had very little negative impact onsample quality and thus inclusion in the fullrange of PISA reports was recommended.

Iceland

Iceland fully met the PISA standards, andinclusion in the full range of PISA reports wasrecommended.

Ireland

Ireland fully met the PISA standards, andinclusion in the full range of PISA reports wasrecommended.

Italy

Italy fully met the PISA standards, and inclusionin the full range of PISA reports wasrecommended.

Japan

The translation of the assessment material inJapan was somewhat deficient. A rather highnumber of flawed items was observed in boththe field trial and main study. However, carefulexamination of item-by-country interactions anddiscussion with the Japanese centre resulted inthe omission of only one item (see Figure 25),where there was a problematic interaction. Noremaining potential problem led to detectable

Dat

a ad

jud

icat

ion

185

item-by-country interactions.Japan did not follow the PISA 2000 design,

which required sampling students rather thanclasses in each school. This means that the between-class and between-school variance components willbe confounded. The Japanese sample is suitablefor PISA analyses that do not involve disentanglingschool and class variance components, but itcannot be used in a range of multilevel analyses.

It was recommended, therefore, not to includedata from Japan in analyses that requiredisentangling variance components, but the datacan be included in all other analyses and reports.

Korea

Korea used non-standard translation procedures,which resulted in a few anomalies. When reviewingitems for inclusion or exclusion in the scaling ofPISA data (based upon psychometric criteria),the items were carefully reviewed for potentialaberrant behaviour because of translation errors. Fourreading items were deleted because of potentialproblems, and the remaining data were recommendedfor inclusion in the full range of PISA reports.

Latvia

Latvia did not follow the PISA 2000 translationguidelines, and the international coding studyshowed many inconsistencies between coding inLatvia and international coding. In the inter-countryreliability study, the agreement between nationalmarkers and international verifiers was 80.6 percent. Of the remaining 19.4 per cent, 2.5 per centof the ratings were found to be too harsh and9.4 per cent too lenient. These results may be dueto the fact that the marking guide was nottranslated into Latvian on the assumption thatall markers were sufficiently proficient in Englishto use its English version.

Since this rater inconsistency was likely to affectonly a small proportion of items, and becausethe disagreements included both harshness andleniency deviations, it was concluded that thedata could be recommended for inclusion in thefull range of PISA reports.

Liechtenstein

Liechtenstein fully met the PISA standards, and inclu-sion in the full range of PISA reports was recommended.

Luxembourg

The school level exclusion rate of 9.1 per cent inLuxembourg was somewhat higher than thePISA requirements. The majority of excludedstudents (6.5 per cent) were instructed inlanguages other than the PISA test languages(French and German) and attended the EuropeanSchool. As this deviation was judged to have nomarked effect on the results, it was concludedthat Luxembourg could be included in the fullrange of PISA reports.

Mexico

Mexico fully met the PISA standards, andinclusion in the full range of PISA reports wasrecommended. However, the proportion of 15-year-olds enrolled in schools in Mexico wasmuch lower than in all other participatingcountries (52 per cent).

Netherlands

The international design and methodology werewell implemented in the Netherlands. Unfortunately,school participation rates were not acceptable.Though the initial rate of 27.2 per cent was farbelow the requirement to even considerreplacement (65 per cent), the Netherlands wasallowed to try to reach an acceptable responserate through replacements. However, thepercentage after replacement (55.6 per cent) wasstill considerably below the standard required.

The Consortium undertook a range ofsupplementary analyses jointly with the NPM inthe Netherlands, which confirmed that the datamight be sufficiently reliable to be used in somerelational analyses. The very low response rate,however, implied that the results from the PISAdata for the Netherlands cannot, with any confid-ence, be expected to reflect the national populationto the level of accuracy and precision required.This concern extends particularly to distributionand sub-group information about achievement.

The upper secondary educational system in theNetherlands consists of five distinct tracks. VWOis the most academic track, and on a continuumthrough general to vocational education, tracksare HAVO, MAVO, and (I)VBO. Many schoolsoffer more than one track, but few offer all four.Thus, for example, of the 95 schools thatparticipated in PISA which could be matched to

PIS

A 2

00

0 t

ech

nic

al r

epor

t

186

the national database of examination results, 38offered (I)VBO, 76 offered MAVO, 64 offeredHAVO and 66 offered VWO. Each part of theanalysis therefore applied to only a sub-set of theschools.

Depending upon the track, students take theseexaminations at the end of grades 10, 11, or 12.The PISA population was mostly enrolled ingrades 9 and 10 in 2000 (and thus is neither thesame cohort, nor, generally, at the sameeducational level as the population for whichnational examination results are available). Theschool populations for these two groups ofstudents, however, are the same.

To examine the characteristics of the PISA2000 sample in the Netherlands (at the schoollevel), the NPM was able to access the results ofnational examinations in 1999. Data fromnational examinations were combined for thetwo vocational tracks (IVBO and VBO), butother than this combination, available data werenot comparable across tracks, which made itdifficult to develop overall summaries of thequality of the PISA school sample. The analysesof data showed two types of outcomes. The firstdealt with the percentage of students taking atleast three subjects at C or D level3 in the(I)VBO track. This analysis provided someevidence that, when compared to the originalsample, the participating schools had a lowerpercentage of such students than the non-participants and the rest of the population.However, there were only 18 participants of 69in the sample, and so the discrepancy was notstatistically significant. There is evidence that theinclusion of the replacement schools (20participated) effectively removed any suchimbalance. Since this track covers only about 40per cent of the schools in the population andsample, little emphasis was placed on this result.

A second analysis was made of the averagegrade point of the national examinations foreach school and for each track, overall and forseveral subject groupings. The average gradepoint average across Dutch language,mathematics and science was used. The resultsare available for each track, but are not compar-able across tracks (as evidenced, for example, bythe fact that students take the examinations at

Dat

a ad

jud

icat

ion

187

different grades in different tracks).For the (I)VBO track, the average grade point

across all subjects is 6.17 for PISA participantsand 6.29 for the rest of the population, adifference of about 0.375 standard deviations,that is statistically significant (p = 0.034). Thisphenomenon is not reflected in the results forDutch language, but is seen also in the resultsfor mathematics and science. Here the mean forparticipants is 5.83, and 6.03 for the rest of thepopulation. This difference is about 0.45standard deviations, and is statistically significant(p = 0.009). These results suggest that PISA datamay somewhat under-estimate the overallproficiency of students in the Netherlands, at leastfor mathematics and science.

In the VWO track, the population variance ofthe PISA participants is considerably smallerthan for the rest of the population, for allsubjects combined, for Dutch language, and formathematics and science. Thus the PISAparticipants appeared to be a considerably morehomogeneous group of schools than thepopulation as a whole, even though their meanachievement is only trivially different. Thus, thevariance of the rest of the population is 76 per centhigher than the PISA participants for all subjectscombined, 39 per cent higher for Dutch language,and 93 per cent higher for mathematics and science.This is also seen in the HAVO track for all subjectsand for Dutch language, and the MAVO trackfor mathematics and science. These results suggestthat the Dutch PISA data might well show asomewhat more homogeneous distribution ofstudent achievement than is true for the wholepopulation, thus leading to biased estimates ofpopulation quantiles, percentages above variouscut-off points, and sub-group comparisons.

Considering the extent of sample non-response, it was noted that the results of thisanalysis did show a high degree of similaritybetween participating schools and the rest of thepopulation on the one hand, and the non-participants on the other. However, since theresponse rate is so low, even minor differencesbetween these groups are of concern.

An additional factor to consider in judging thequality of the PISA sample for the Netherlands isthe actual sample size of students in the PISAdata file. PISA targeted a minimum of 4 500assessed students for each country. Not all countriesquite reached this target, primarily because of

3 Difficulty levels in the national examinations.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

188

low levels of school participation, but the samplesize for the Netherlands is only 56 per cent of thistarget (2 503 students). Bias considerations aside,the sample cannot therefore give results to thelevel of precision envisaged in specifying the PISAdesign. This is especially serious for mathematicsand science, since, by design, only 55.6 per centof the PISA sample were assessed in each ofthese subjects.

In summary, it was concluded that the resultsfrom the Netherlands PISA sample may not havebeen grossly distorted by school non-participation.In fact the correspondence in achievement onnational examinations between the participantsand the population is remarkably high given thehigh level of non-participation. However, the highlevel of school non-participation means that evenreasonably minor differences between theparticipants and the intended sample can meanthat the non-response bias might significantlyaffect the results. The findings of follow-upanalyses above, together with the lowparticipation rate and the small final sample size,meant that the Consortium could not haveconfidence that the results from the PISA datafor the Netherlands reflected the nationalpopulation to the level of accuracy and precisioncalled for in PISA, particularly for distributionand sub-group information about achievement.

It was recommended, therefore, to exclude datafrom the Netherlands from tables and charts inthe international reports that focus on the levelor distribution of student performance in one ormore of the PISA domains, or on the distributionof a characteristic of schools or students. Datafrom the Netherlands could be included in theinternational database and the Netherlandscould be included in tables or charts that focusprimarily on relationships between PISAbackground variables and student performance.

New Zealand

The before-replacement response rate in NewZealand was 77.7 per cent, and the after-replacement response rate was 86.4 per cent,slightly short of the PISA 2000 standards.

Non-response bias in New Zealand wasexamined by looking at national Englishassessment results for Year 11 students in allschools, where 87.6 per cent of the PISApopulation can be found.

A logistic regression approach was alsoadopted by New Zealand to analyse theexistence of a potential bias. The dependentvariable was the school participation status, andthe independent variables in the model were:• school decile, an aggregate at the school level

of the social and economic backgrounds ofthe students attending the school;

• rural versus urban schools;• the school sample frame stratification

variables, i.e., very small school, certainty4 orvery large schools and other schools;

• Year 11 student marks in English on thenational examination; and

• initially, the distinction between public andprivate schools, but this variable was removeddue to the size of the private schoolpopulation.Prior to performing the logistic regression,

Year 11 student marks were averaged accordingto school participation status, and (i) schooldecile, (ii) public versus private school, (iii) ruralversus urban school, and (iv) the school sizestratification variable from the school samplingframe.

The unweighted logistic regression analysesare summarised in Table 72. There is noevidence that low school response rate intro-duced bias into the school sample for New Zealand.

Table 72: Results for the New Zealand LogisticRegression

Parameter Estimate Chi-Squared p-value

English marks 0.0006 0.0229 0.98

Very small school -1.3388 1.4861 0.37

Certainty or very large schools 0.5460 0.5351 0.30

Urban -0.1452 0.4104 0.72

Based on the logistic regression and the sub-group mean achievement, it is possible thatstudents in the responding schools are slightlylower achievers than those in the non-respondingschools. But the difference in the average resultsfor responding and non-responding schools was0.04 standard deviations, which is not

4 Schools with a selection probability of 1.0.

significant. Further, after the effects of the keystratification variables, which were used to makeschool non-response adjustments, there is nonoticeable effect of school mean achievement onPISA participation. Furthermore, there is noevidence that the responding and non-respondingschools might differ substantially in theirhomogeneity (as distinct from the case in theNetherlands).

It was recommended, therefore, that New Zealand data could be included in the fullrange of PISA reports.

Norway

Norway fully met the PISA standards, andinclusion in the full range of PISA reports wasrecommended.

Poland

Poland did not follow the international samplingmethodology. First, Poland did not follow theguidelines for sampling 15-year-olds in primaryschools. Consequently, the small sample ofstudents from these schools, which includedapproximately 7 per cent of the 15-year-oldpopulation, have been excluded from the data.These students were very likely to have been wellbelow the national average in achievement.

Second, Poland extended the testing period,and 14 schools tested outside the three-monthtesting window that was required to ensure theaccurate estimation of student age at testingtime. Analysis showed that schools testingoutside the window performed better thanschools testing inside the window, and theformer were dropped from the nationaldatabase.

Strong evidence was provided that schoolnon-response bias is minimal in the PISA datafor Poland, based on the following:• The majority of school non-response in Poland

was not actual school non-response, but theconsequence of schools testing too late.

• Schools testing outside the window performedbetter than those that tested inside the window,which vindicates the decision to omit themfrom the data, since their higher performancemay have been due to the later testing.

• The results with the late schools omitted areabout one scale score point lower than they

would be had these late schools been included.This indicates that had these schools testedduring the window and been included, theresults would have been less than one pointabove the published result. It seems extremelylikely that the result would have fallen betweenthe published result and one point higher.

• The bias from the necessary omission of thelate assessed schools seems certain to be lessthan one point, which is insignificantconsidering the sampling errors involved.

• All non-responding schools (i.e., late-testingschools plus true non-respondents) match theassessed sample well in terms of big city/smallcity/town/rural representation.The relatively low level of non-response

(i.e., Poland did not fall far below the acceptablestandard) indicates that no significant bias hasresulted from school non-response.

It was concluded that while the exclusion ofthe primary schools should be noted, data couldbe recommended for inclusion in the full rangeof PISA reports, provided the difference betweenthe target population in Poland and the inter-national target population was clearly noted.

Portugal

Portugal fully met the PISA standards, andinclusion in the full range of PISA reports wasrecommended.

Russian Federation

The adjudication noted four concerns about theimplementation of PISA 2000 in the RussianFederation. First, material was not translatedusing the standard procedures recommended inPISA, and the number of flawed items wasrelatively high in the main study. Second, in theinter-country reliability study, the agreementbetween national markers and internationalverifiers was 88.5 per cent, which is just belowthe 90 per cent required. Of the remaining 11.5per cent, 0.9 per cent of the ratings were foundto be too harsh, 7.0 per cent too lenient and 3.6per cent inconsistent. The Consortium discussedwith the Russian Federation national centre aboutthe possibility of re-scoring the open-ended itemsbut concluded that this would have a negligibleeffect on country results. In making this assessment,the Consortium carefully examined the psycho-

Dat

a ad

jud

icat

ion

189

metric properties of the Russian Federation itemsand did not identify an unusual number of item-by-country interactions. Third, the Student TrackingForms did not include students excluded withinthe school, meaning that it was not possible toestimate the total exclusion rate for the RussianFederation. Fourth, the correct procedure forsampling small schools was not followed whichled to a minor requirement to trim the weights,making small schools very slightly under-represented in the weighted data.

It was concluded that while these issues shouldbe noted, the data could be recommended forinclusion in the full range of PISA reports.

Spain

Spain did not follow the correct procedure forsampling small schools. This led to a need totrim the weights; small schools consequentlymay be slightly under-represented. This problemis not regarded as significant, and inclusion inthe full range of PISA reports was recommended.

Sweden

Sweden fully met the PISA standards, and inclusionin the full range of PISA reports was recommended.

Switzerland

Switzerland included a national option to surveygrade 9 students and to over-sample them insome parts of the country. Thus, there wereexplicit strata containing schools with grade 9,and other strata containing schools with 15-year-olds but no grade 9 students.

From the grade 9 schools, a sample of schoolswas selected in some strata within which asample of 50 students was selected from amongthose in grade 9, plus any other 15-year-olds.

In other strata, where stratum-level estimatesfor grade 9 students were required from the PISAassessment, a relatively large sample of schoolswas selected. In a sub-sample of these schools(made via systematic selection), a sample of 50grade 9 students, plus 15-year-old students, wasselected. In the remaining schools, a sample of 35students was selected from among the grade 9students only (regardless of their age).

In the strata of schools with no grade 9, asample of schools was selected, from each of which

a sample of 35 15-year-old students was selected.Thus, the sample contained three groups of

students: those in grade 9 aged 15, those in grade 9not aged 15, and those aged 15 not in grade 9.Within some strata, the students in the third groupwere sampled at a lower rate (considering theschool and student-within-school sampling ratescombined) than students in grade 9. The PISAweighting procedure ensured that these differentialsampling rates were taken into account whenanalysing the 15-year-old samples, grade 9 andothers combined. This procedure meant thatunweighted analyses of PISA data in Switzerlandwould be very likely to lead to erroneousconclusions, as grade 9 students were over-represented in the sample.

This entire procedure was fully approved andcarried out correctly. However, there was a minorproblem with school non-response. In one stratumwith about 1 per cent of the PISA population inSwitzerland, 35 schools in the population had bothgrade 9 and other 15-year-old students. All 35schools were included in the grade 9 sample, anda sub-sample of two of the schools was selected toinclude other 15-year-old students in the sample aswell. The school response rate in this stratum wasvery low. One of the two schools that were toinclude all 15-year-olds participated, and only threeof the other 33 schools participated. This very lowresponse rate (with a big differential between thetwo parts of the sample) meant that the standardapproach to creating school non-responseadjustments would not have worked appropriately.Therefore, schools from this explicit stratum hadto be combined with those from another stratumto adjust school non-response. The effects on theoverall results for Switzerland are very minor, asthis stratum is so small.

The translation verifier noted that the Swiss-German version of the material underwent sub-stantial national revisions after the field trial(most of which were not documented), while thechanges introduced in the source versions andthe corrections suggested by the verifier wereoften overlooked. As a result, the Swiss-Germanversion differed significantly from the otherGerman versions of the material, and hadslightly more flawed items.

It was concluded that none of the deviationsfrom the PISA standards would noticeably affectthe results, and inclusion in the full range ofPISA reports was recommended.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

190

United Kingdom

The adjudication revealed three potentialproblems with data from the United Kingdom.First, schools in Wales were excluded. Second,the school response rate prior to replacement (61.3 per cent) was below the PISA 2000standard of 65 per cent—the response rate afterreplacements was 82.1 per cent. Further analysesof the potential for school non-response biaswere therefore conducted. Third, testadministration procedures in Scotland did notrespect the PISA requirements. Testadministrators were neither trained nor given anadequate script. This is considered to be a matterof serious concern, but Scotland has only a smallpercentage of 15-year-olds in the UnitedKingdom, and this violation did not seriouslyaffect the United Kingdom’s results. In addition,student sampling procedures were not correctlyimplemented in Scotland. The unapprovedprocedure implemented was dealt with and isvery unlikely to have biased the UnitedKingdom’s results, but it did increase the numberof absent students in the data.5

The potential influence of the exclusion ofschools in Wales was considered by comparingthe distribution of General Certificate ofSecondary Education (GCSE) results for Englandand Wales. The comparison showed that thedistribution of achievement in Wales isessentially the same as in England. Further, sinceWales has only 5.1 per cent of the UnitedKingdom’s population, while England has 82.4 per cent, it did not seem credible that thedistribution in Wales could be so different as toadd a noticeable bias.

The primary source of the problems with schoolresponse was in England, where the response ratebefore replacement was 58.8 per cent. The NPMtherefore carried out some regression analyses inorder to examine which factors were associatedwith school non-response (prior to replacement)in England. The possibility of bias in the initialsample of schools, before replacement of

Dat

a ad

jud

icat

ion

191

refusing schools, was examined by evaluatingindependent variables as predictors of schoolnon-response. These included educationaloutcomes of the schools at GCSE level (publicexaminations taken at age 16); and socio-economic status as measured by the percentageof pupils eligible for free school meals—a widelyaccepted measure of poverty in education studiesin England.

In addition, sample stratification variableswere included in the models. Since these wereused to make school non-response adjustments,the relevant question is whether GCSE level andthe percentage of students eligible for free mealsat school are associated with school responsewhen they are also included in the model.

The conclusions were that, in view of thesmall coefficients for the response status variableand their lack of statistical significance (p = 0.426 and 0.150, respectively), there was nosubstantial evidence of non-response bias interms of the GCSE average point score (inparticular), or of the percentage of studentseligible for free school meals. Participatingschools were estimated to have a mean GCSElevel only 0.734 units higher than those that didnot, and a percentage of students eligible for freeschool meals that was only 0.298 per cent lower.

A further potential problem, in addition to thethree discussed above, was that 11 schools (of 349)restricted the sample of PISA-eligible students insome way, generally by limiting it to those studentsin the lower of the two grades with PISA-eligiblestudents. However, weighting adjustments wereable to partially compensate for this.

It was concluded that while these issues shouldbe noted, the data could be recommended forinclusion in the full range of PISA reports.

United States

The before-replacement response rate in theUnites States was 56.4 per cent, and the after-replacement response rate was 70.3 per cent,which fell short of the PISA 2000 standards. TheNPM explored this matter extensively with theConsortium and provided data as evidence thatthe achieved sample was representative.

It was not possible to check for systematicdifferences in the assessment results as theUnited States had no assessment data from thenon-responding schools, but it was possible to

5 Scotland sampled 35 students from each school. Thefirst 30 students were supposed to be tested and the lastfive students were considered as replacement students ifone or more students could not attend the testingsession. The Consortium dealt with this deviation byregarding all sampled students who were not tested asabsentees/refusals.

check for systematic differences in the schoolcharacteristics data available for both theresponding and non-responding schools. It iswell known from past surveys in the UnitedStates that characteristics such as region, urbanstatus, percentage of minorities, public/privatestatus, etc., are correlated with achievement, sothat systematic differences in them are likely toindicate the presence of potential bias inassessment estimates.

The potential predictors for school non-participation evaluated in the additional analysisprovided by the United States are:• NAEP region (Northeast, Midwest, South,

West);• urban status (urban or non-urban area);• public/private status;• school type (public, Catholic, non-Catholic

religious, non-religious private, Department ofDefense);

• minority percentage category (less than 4 per cent, 4 to 18 per cent, 18 to 34 per cent, 34 to 69 per cent, higher than 69 per cent);

• percentage eligible for school lunch (less than13.85 per cent, 13.85 to 37.1 per cent, higherthan 37.1 per cent, missing);

• estimated number of 15-year-olds; and• school grade-span (junior high: low grade, 7th

and high grade, 9th; high school: low grade,9th; middle school: low grade, 4th and highgrade, 8th; combined school: all others).Each characteristic was chosen because it was

available for both participating and non-participating schools, and because in pastsurveys it was found to be linked to responsepropensity.

The analysis began with a series of logisticregression models that included NAEP region,urban status, and each of the remaining predictorsindividually in turn. Region and urban statuswere included in each model because they havehistorically been correlated with schoolparticipation and are generally used to define cellsfor non-response adjustments. The remainingpredictors were then evaluated with region andurban status already included in the model. Amodel with all predictors was also fitted.

In most models, the South and Northeastregion parameters were significant at the 0.05-level. The p-value for urban status wassignificant at the 0.10-level, except in the modelcontaining all predictors.

The other important predictors were percentageof minority students and percentage of studentseligible for free lunches. The schools with 34 to69 per cent minority students were signif-icantlymore likely to refuse to participate than schoolswith more than 69 per cent minority students (p= 0.0072). The 18 to 34 per cent minoritystudent category also had a relatively small p-value: 0.0857. Two of the categories ofpercentage of students eligible for free luncheshad p-values around 0.10. In the final modelthat included all predictors, the percentage ofminority students and percentage of studentseligible for free lunches proved to be the mostimportant predictors. These results areconfirmed in the model selection analysispresented below.

After fitting a logistic regression model withall potential predictors, three types of modelselection were performed to select significantpredictors: backward, forward, and stepwise.6

For this analysis, all three model-selectionprocedures chose the same four predictors: NAEPregion, urban status, percentage of minoritystudents, and percentage of students eligible forfree lunches. The parameter estimates, chi-squarestatistics and p-values for this final model areshown in Table 73.

The non-response analyses did find differencesin the distribution between respondents and non-respondents on several of the school characteristics.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

192

6 The dependent variable again was the school’s participationstatus. These techniques all select a set of predictor variablesfrom a larger set of candidates. In backward selection, amodel is fitted using all the candidate variables. Significancelevels are then checked for all predictors. If any level isgreater than the cut-off value (0.05 for this analysis), onepredictor is dropped from the model–the one with the greatestp-value (i.e., the least significant predictor). The procedurethen iteratively checks the remaining set of predictors,dropping the least powerful predictor if any are above the cut-off value. NAEP region and urban status were not allowed tobe dropped. The procedure stops when all remainingpredictors have p-values smaller than the cut-off value.Forward selection works in a similar fashion, but starts withonly the intercept term and any effects forced into the model.It then adds to the model the predictor with the next highestlevel of significance, as computed by the score chi-squaredstatistic. The process continues until no effect meets therequired level of significance. An effect is never removedonce it has entered the model. Stepwise selection begins like forward selection, but theeffects do not necessarily remain in the model once they areentered. A forward selection step may be followed by abackwards elimination step if an effect is no longersignificant in the company of a newly-entered effect.

As expected, region and urban status werefound to be predictors of response status. Inaddition, the percentage of minority studentsand percentage of students eligible for freelunches were also important predictors. Sincethey are often correlated with achievement,assuming that school non-response is a randomoccurrence could lead to bias in assessmentestimates.

The school non-response adjustment hasincorporated these two additional predictorsalong with region and urban status in definingthe adjustment cells. Defining the cells in thisway distributes the weight of non-participatingschools to ‘similar’ (in terms of characteristicsknown for all schools) participating schools,which reduces the potential bias due to lack ofassessment in non-participating schools.However, there is no way to completelyeliminate all potential non-response bias becauseassessment information is not available fromschools that do not participate.

There is also a strong, but highly non-linear,relationship with minority (black and Hispanic)enrolment. Schools with relatively high andrelatively low minority enrolments wereconsiderably more likely to participate thanthose with intermediate levels of minority

enrolment. The implications for the direction ofthe bias are not clear. Other characteristicsexamined do not show a discernible relationship.

Many of the 145 participating schoolsrestricted the sample of PISA-eligible students insome way, generally by limiting it to studentswithin the modal grade. Weighting adjustmentswere made to partially compensate for this.Schools that surveyed the modal grade (10) weretreated as respondents in calculating theweighted response rates, while those that did notpermit testing of grade 10 students were handledas non-respondents for calculating schoolresponse rates, but their student data wereincluded in the database.

It was concluded that while these issuesshould be noted, the data could berecommended for inclusion in the full range ofPISA reports.

Dat

a ad

jud

icat

ion

193

Table 73: Results for the United States Logistic Regression

Parameter Estimate Chi-Squared p-value

Northeast region 0.4890 2.7623 0.0965

Midwest region 0.3213 1.1957 0.2742

South region -0.7919 4.7966 0.0285

(West region) - - -

Urban area -0.3644 2.6241 0.1053

(Non-urban area) - - -

Less than 4% minority -0.6871 2.7699 0.0960

4-18% minority -0.2607 0.5384 0.4631

18 -34% minority 0.6348 3.8064 0.0511

34-69% minority 1.3001 13.5852 0.0002

(Greater than 69% minority) - - -

Missing free lunch eligibility 0.5443 3.1671 0.0751

≤ 13.85% free lunch eligibility 0.6531 3.6340 0.0566

13.85-37.1% free lunch eligibility -1.0559 10.329 0.0013

PROFICIENCY SCALESConstruction

Ross Turner

Introduction

PISA seeks to report outcomes in terms ofproficiency scales that are based on scientifictheory and that are interpretable in policy terms.There are two further considerations for thedevelopment of the scales and levels:• PISA must provide a single score for each

country for each of the three domains. It isalso recognised that multiple scales might beuseful for certain purposes, and thedevelopment of these has been consideredalongside the need for a single scale.

• The proficiency descriptions must shed light ontrends over time. The amount of data availableto support the detailed development of profic-iency descriptions varies, depending on whethera particular domain is a ‘major’ or ‘minor’domain for any particular survey cycle. Decisionsabout scale development need to recognise thisvariation, and must facilitate the descriptionof any changes in proficiency levels achieved bycountries from one survey cycle to the next.Development of a method for describing

proficiency in PISA reading, mathematical andscientific literacy took place over more than ayear, in order to prepare for the reporting ofoutcomes of the PISA 2000 surveys. The threeFunctional Expert Groups (FEGs) (for reading,mathematics and science) worked with theConsortium to develop sets of describedproficiency scales for the three PISA 2000 testdomains. Consultations of the Board ofParticipating Countries (BPC), National ProjectManagers (NPMs) and the PISA TechnicalAdvisory Group (TAG) took place over several

stages. The sets of described scales were presentedto the BPC for approval in April 2001. Thischapter presents the outcomes of this process.

Development of theDescribed Scales

The development of described proficiency scalesfor PISA was carried out through a processinvolving a number of stages. The stages aredescribed here in a linear fashion, but in realitythe development process involved somebackwards and forwards movement where stageswere revisited and descriptions wereprogressively refined. Essentially the samedevelopment process was used in each of thethree domains.

Stage 1: Identifying Possible Sub-scales

The first stage in the process involved theexperts in each domain articulating possiblereporting scales (dimensions) for the domain.

For reading, two main options were activelyconsidered. These were scales based on twopossible groupings of the five ‘aspects’ of reading(retrieving information; forming a broadunderstanding; developing an interpretation;reflecting on content; and reflecting on the formof a text).1

In the case of mathematics, a singleproficiency scale was proposed, though the

section five: scale construction and data products

1 While strictly speaking the scales based on aspects ofreading are sub-scales of the combined readingliteracy scale, for simplicity they are mostly referred toas ‘scales’ rather than ‘sub-scales’ in this report.

Chapter 16

possibility of reporting according to the ‘bigideas’ or the ‘competency classes’ described inthe PISA mathematics framework was alsoconsidered. This option is likely to be furtherexplored in PISA 2003 when mathematics is themajor domain.

For science, a single overall proficiency scalewas proposed by the FEG. There was interest inconsidering two sub-scales, for ‘scientificknowledge’ and ‘scientific processes’, but thesmall number of items in PISA 2000, whenscience was a minor domain, meant that this wasnot possible. Instead, these aspects were builtinto descriptions of a single scale.

Wherever multiple scales were under consider-ation, they arose clearly from the framework forthe domain, they were seen to be meaningful andpotentially useful for feedback and reportingpurposes, and they needed to be defensible withrespect to their measurement properties.

Because of the longitudinal nature of the PISAproject, the decision about the number and natureof reporting scales had to take into account thefact that in some test cycles a domain will betreated as ‘minor’ and in other cycles as ‘major’.The amount of data available to support thedevelopment of described proficiency scales willvary from cycle to cycle for each domain, but theBPC expects proficiency scales that can becompared across cycles.

Stage 2: Assigning Items to Scales

The second stage in the process was to reviewthe association of each item in the main studywith each of the scales under consideration. Thiswas done with the involvement of the subject-matter FEGs, the test developers and Consortiumstaff, and other selected experts (for example,some NPMs were involved). The statisticalanalysis of item scores from the field trial wasalso useful in identifying the degree to whichitems allocated to each sub-scale fitted withinthat scale, and in validating the work of thedomain experts.

Stage 3: Skills Audit

The next stage involved a detailed expertanalysis of each item, and in the case of itemswith partial credit, for each score step within theitem, in relation to the definition of the relevant

sub-scale from the domain framework. The skillsand knowledge required to achieve each scorestep were identified and described.

This stage involved negotiation and discussionamong the interested players, circulation of draftmaterial, and progressive refinement of drafts onthe basis of expert input and feedback.

Stage 4: Analysing Field Trial Data

For each set of scales being considered, the fieldtrial data for those items that were subsequentlyselected for the main study were analysed usingitem response techniques to derive difficultyestimates for each achievement threshold foreach item in each sub-scale.

Many items had a single achievementthreshold (associated with getting the item rightrather than wrong). Where partial credit wasavailable, more than one achievement thresholdcould be calculated (achieving a score of one ormore rather than zero, two or more rather thanone, etc.).

Within each sub-scale, achievement thresholdswere placed along a difficulty continuum linkeddirectly to student abilities. This analysis givesan indication of the utility of each scale from ameasurement perspective (for examples, seeFigures 22, 23 and 24).

Stage 5: Defining the Dimensions

The information from the domain-specific expertanalysis (Stage 3) and the statistical analysis(Stage 4) was combined. For each set of scalesbeing considered, the item score steps wereordered according to the size of their associatedthresholds and then linked with the descriptionsof associated knowledge and skills, giving ahierarchy of knowledge and skills that definedthe dimension. Natural clusters of skills werefound using this approach, which provided abasis for understanding each dimension anddescribing proficiency levels.

Stage 6: Revising and Refining MainStudy Data

When the main study data became available, theinformation arising from the statistical analysisabout the relative difficulty of item thresholdswas updated. This enabled a review and revision

PIS

A 2

00

0 t

ech

nic

al r

epor

t

196

of Stage 5 by the working groups, FEGs andother interested parties. The preliminarydescriptions and levels were then reviewed andrevised in the light of further technicalinformation that was provided by the TechnicalAdvisory Group, and an approach to defininglevels and associating students with those levelsthat had arisen from discussion between theConsortium and the OECD Secretariat wasimplemented.

Stage 7: Validating

Two major approaches to validation were thenconsidered and used to varying degrees by thethree working groups. One method was toprovide knowledgeable experts (e.g., teachers, or members of the subject matter expertgroups) with material that enabled them tojudge PISA items against the described levels,or against a set of indicators that underpinnedthe described levels. Some use of such aprocess was made, and further validationexercises of this kind may be used in thefuture. Second, the described scales weresubjected to an extensive consultation processinvolving all PISA countries through theirNPMs This approach to validation rests on theextent to which users of the described scalesfind them informative.

Defining ProficiencyLevels

Developing described proficiency levelsprogressed in two broad phases. The first, whichcame after the development of the describedscales, was based on a substantive analysis ofPISA items in relation to the aspects of literacythat underpinned each test domain. Thisproduced proficiency levels that reflectedobservations of student performance and adetailed analysis of the cognitive demands ofPISA items. The second phase involved decisionsabout where to set cut-off points for levels andhow to associate students with each level. This isboth a technical and very practical matter ofinterpreting what it means to ‘be at a level’, andhas very significant consequences for reportingnational and international results.

Several principles were considered fordeveloping and establishing a useful meaning for

‘being at a level’, and therefore for determiningan approach to locating cut-off points betweenlevels and associating students with them:• A ‘common understanding’ of the meaning of

levels should be developed and promoted.First, it is important to understand that theliteracy skills measured in PISA must beconsidered as continua: there are no naturalbreaking points to mark borderlines betweenstages along these continua. Dividing each ofthese continua into levels, though useful forcommunication about students’ development,is essentially arbitrary. Like the definition ofunits on, for example, a scale of length, thereis no fundamental difference between 1 metreand 1.5 metres—it is a matter of degree. It isuseful, however, to define stages, or levelsalong the continua, because they enable us tocommunicate about the proficiency ofstudents in terms other than numbers. Theapproach adopted for PISA 2000 was that itwould only be useful to regard students ashaving attained a particular level if this wouldmean that we can have certain expectationsabout what these students are capable of ingeneral when they are said to be at that level.It was decided that this expectation wouldhave to mean at a minimum that students at aparticular level would be more likely to solvetasks at that level than to fail them. Byimplication, it must be expected that theywould get at least half of the items correct ona test composed of items uniformly spreadacross that level, which is useful in helping tointerpret the proficiency of students atdifferent points across the proficiency rangedefined at each level.

• For example, students at the bottom of a levelwould complete at least 50 per cent of taskscorrectly on a test set at the level, while studentsat the middle and top of each level would beexpected to achieve a much higher successrate. At the top end of the bandwidth of alevel would be the students who are ‘masters’of that level. These students would be likelyto solve about 80 per cent of the tasks at thatlevel. But, being at the top border of thatlevel, they would also be at the bottom borderof the next level up, where according to thereasoning here they should have a likelihoodof at least 50 per cent of solving any tasksdefined to be at that higher level.

Pro

fici

ency

sca

les

con

stru

ctio

n

197

P=?

P=?

P=?

P=?

Student at Top of Level

Student at Bottom of Level

Item at Top of Level

Item at Bottom of Level

• Further, the meaning of being at a level for agiven scale should be more or less consistentfor each level. In other words, to the extentpossible within the substantively baseddefinition and description of levels, cut-offpoints should create levels of more or lessconstant breadth. Some small variation maybe appropriate, but in order for interpretationand definition of cut-off points and levels tobe consistent, the levels have to be aboutequally broad. Clearly this would not apply tothe highest and lowest proficiency levels,which are unbounded.

• A more or less consistent approach should betaken to defining levels for the different scales.Their breadth may not be exactly the same foreach proficiency scale, but the same kind ofinterpretation should be possible for eachscale that is developed.A way of implementing these principles was

developed. This method links the two variablesmentioned in the preceding paragraphs, and athird related variable. The three variables can beexpressed as follows:• the expected success of a student at a

particular level on a test containing items atthat level (proposed to be set at a minimumthat is near 50 per cent for the student at thebottom of the level, and higher for otherstudents in the level);

• the width of the levels in that scale(determined largely by substantive

considerations of the cognitive demands ofitems at the level and observations of studentperformance on the items); and

• the probability that a student in the middle ofa level would correctly answer an item ofaverage difficulty for that level (in fact, theprobability that a student at any particularlevel would get an item at the same levelcorrect), sometimes referred to as the ‘RP-value’for the scale (where ‘RP’ indicates ‘responseprobability’).Figure 31 summarises the relationship among

these three mathematically linked variables. Itshows a vertical line representing a part of thescale being defined, one of the bounded levels onthe scale, a student at both the top and thebottom of the level, and reference to an item atthe top and an item at the bottom of the level.Dotted lines connecting the students and itemsare labelled P=? to indicate the probabilityassociated with that student correctly respondingto that item.

PISA 2000 implemented the followingsolution: start with the substantively determinedrange of abilities for each bounded level in eachscale (the desired band breadth); then determinethe highest possible RP value that will becommon across domains—that would give effectto the broad interpretation of the meaning of‘being at a level’ (an expectation of correctlyresponding to a minimum of 50 per cent of theitems in a test at that level).

PIS

A 2

00

0 t

ech

nic

al r

epor

t

198

Figure 31: What it Means to be at a Level

After doing this, the exact average percentageof correct answers on a test composed of itemsat a level could vary slightly among the differentdomains, but will always be at least 50 per centat the bottom of the level.

The highest and lowest described levels areunbounded. For a certain high point on the scaleand below a certain low point, the proficiencydescriptions could, arguably, cease to beapplicable. At the high end of the scale, this isnot such a problem since extremely proficientstudents could reasonably be assumed to becapable of at least the achievements described forthe highest level. At the other end of the scale,however, the same argument does not hold. Alower limit therefore needs to be determined forthe lowest described level, below which nomeaningful description of proficiency is possible.

As levels 2, 3 and 4 (within a domain) will beequally broad, it was proposed that the floor ofthe lowest described level be placed at thisbreadth below the upper boundary of level 1(that is, the cut-off between levels 1 and 2).Student performance below this level is lowerthan that which PISA can reliably assess and,more importantly, describe.

Reading Literacy

Scope

The purpose of the PISA reading literacyassessment is to monitor and report on thereading proficiency of 15-year-olds as theyapproach the end of compulsory schooling. Eachtask in the assessment has been designed togather a specific piece of evidence about readingproficiency by simulating a reading activity thata reader might experience in or outside school,as an adolescent, or in adult life.

PISA reading tasks range from verystraightforward comprehension activities to quitesophisticated activities requiring deep andmultiple levels of understanding. Readingproficiency is characterised by organising thetasks into five levels of increasing proficiency,with Level 1 describing what is required torespond successfully to the most basic tasks, andLevel 5 describing what is required to respondsuccessfully to the most demanding tasks.

Three Reading Proficiency Scales

Why aspect scales?The scales represent three major reading aspectsor purposes, which are all essential parts oftypical reading in contemporary developedsocieties: retrieving information from a variety ofreading materials; interpreting what is read; andreflecting upon and evaluating what is read.

People often have practical reasons forretrieving information from reading material, andthe tasks can range from locating details requiredby an employer from a job advertisement tofinding a telephone number with several prefixes.

Interpreting texts involves processing a text tomake internal sense of it. This includes a widevariety of cognitive activities, all of which involvesome degree of inference. For example, a taskmay involve connecting two parts of a text, pro-cessing the text to summarise the main ideas, orfinding a specific instance of something describedearlier in general terms. When interpreting, areader is identifying the underlying assumptionsor implications of part or all of a text.

Reflection and evaluation involves drawing onknowledge, ideas, or attitudes that are externalto the text in order to relate the new informationthat it provides to one’s own conceptual andexperiential frames of reference.

Dividing reading into different aspects is not theonly possible way to divide the task of reading. Texttypes or formats could be used for organising thedomain, as earlier large-scale studies, includingthe International Association for the Evaluationof Educational Achievement (IEA) ReadingLiteracy Study (narrative, exposition anddocument) and the International Adult LiteracySurvey (IALS, prose and document) did.However, using aspects seems to reflect mostclosely the policy objectives of PISA, and thehope is that the provision of these scales willoffer a different perspective to understandinghow reading proficiency develops.

Why three scales rather than five?The three scales are based on the set of fiveaspect variables described in the PISA readingliteracy framework (OECD, 1999a): retrievinginformation; forming a broad understanding;developing an interpretation; reflecting on thecontent of a text; and reflecting on the form of atext. These five aspects were defined primarily toensure that the PISA framework and the

Pro

fici

ency

sca

les

con

stru

ctio

n

199

collection of items developed according to thisframework would provide an optimal coverage ofthe domain of reading literacy as it was definedby the experts. For reporting purposes and forcommunicating the results, a more parsimoniousmodel was thought to be more appropriate.Collapsing the five-aspect framework on to threeaspect scales is justified by the high correlationsbetween the sets of items. Consequently ‘developingan interpretation’ and ‘forming a broad under-standing’ have been grouped together becauseinformation provided in the text is processed bythe reader in some ways. In the case of ‘forminga broad understanding’, this occurs in the wholetext, and in the case of ‘developing an inter-pretation’, this occurs in one part of the text inrelation to another. ‘Reflecting on the content ofa text’ and ‘reflecting on the form of a text’ havebeen collapsed into a single reflection andevaluation sub-scale because the distinctionbetween reflecting on form and reflecting oncontent, in practice, was not found to be notice-able in the data as they were collected in PISA.

Even these three scales overlap considerably. Inpractice, most tasks make many different demandson readers. The three aspects are conceived asinterrelated and interdependent, and assigning atask to one or another of the scales is often a matterof fine discrimination about its salient features,and about the approach typically taken to it.Empirical evidence from the main study data wasused to validate these judgements. Despite theinterdependence of the three scales, they revealinteresting and useful distinctions between countriesand among sub-groups within the countries.

Pragmatic and technical reasons alsodetermined the use of only three aspects todescribe proficiency scales. In 2003 and 2006,reading will be a minor domain and will thereforebe restricted to about 30 items, which would beinsufficient for reporting over five scales.

Task Variables

In developing descriptions of the conditions andfeatures of tasks along each of the three scales,distinct, intersecting sets of variables associatedwith task difficulty seemed to be operating.These variables are based on the judgements byreading experts of the items that fall within eachlevel band rather than on any systematic analysis,although systematic analyses will be made later.

The PISA definition of reading literacy builds,in part, on the IEA Reading Literacy Study (Elley,1992), but also on IALS. It reflects the emphasisof the latter study on the importance of readingskills in active and critical participation insociety. It was also influenced by current theorieswhich emphasise the interactive nature of reading(Dechant, 1991; McCormick, 1988; Rumelhart,1985), on models of discourse comprehension(Graesser, Millis and Zwaan, 1997; Kintsch andvan Dijk, 1978; Van Dijk and Kintsch, 1983),and on theories of performance in solvingreading tasks (Kirsch and Mosenthal, 1990). Thedefinition of variables operating within thereflection and evaluation sub-scale is, we believe,PISA’s major new contribution to conceptualisingreading. PISA is the first large-scale study toattempt to define variables operating within thereflection and evaluation sub-scale and topropose characterising populations from thisperspective in its reporting.

The apparently salient variables in each scaleare described briefly below. It is important toremember that the difficulty of any particularitem is conditioned by the interaction among allthe variables.

Task Variables for the combinedReading Literacy Scale

The overall proficiency scale for reading literacycovers three broad aspects of reading: retrieving,interpreting and reflecting upon and evaluatinginformation. The proficiency scale includes tasksassociated with each of these.

The difficulty of any reading task depends onan interaction between several variables. The typeof process involved in retrieving, interpreting, orreflecting on and evaluating information is salientin relation to difficulty. The complexity andsophistication of processes range from makingsimple connections between pieces of information,to categorising ideas according to given criteria,to critically evaluating a section of text. Thedifficulty of retrieval tasks is associated particularlywith the number of pieces of information to beincluded in the response, the number of criteriathat the information must meet, and whetherwhat is retrieved needs to be sequenced in aparticular way. For interpretative and reflectivetasks, the length, complexity and amount of textthat needs to be assimilated particularly affect

PIS

A 2

00

0 t

ech

nic

al r

epor

t

200

difficulty. The difficulty of items requiringreflection is also conditioned by the familiarity orspecificity of the knowledge that must be drawnon from outside the text. For all aspects ofreading, the tasks tend to be more difficult whenthere is no explicit mention of the ideas orinformation required to complete them, therequired information is not prominent in the text,and much competing information is present.

Task Variables for the RetrievingInformation Sub-Scale

Four important variables characterising the PISAretrieving information tasks were identified.They are outlined here and are incorporated inthe proficiency scale description in Figure 32.

Type of retrievalType of retrieval is related to the type of matchmicro-aspect referred to in the readingframework for PISA (OECD, 1999a). Locating isthe primary process used for tasks in this scale.Locating tasks require a reader to find andretrieve information in a text using criteria thatare specified in a question or directive. The morecriteria that need to be taken into account, themore the task will tend to be difficult. Thereader may need to consult a text several timesby cycling through it to find and retrievedifferent pieces of information. The morenumerous the pieces of information to beretrieved, and whether or not they must besequenced in a particular way, the greater thedifficulty of the task.

Explicitness of informationExplicitness of information refers to the degreeof literalness or explicitness in the task and inthe text, and considers how much the reader mustuse inference to find the necessary information.The task becomes more difficult when the readermust infer what needs to be considered in lookingfor information, and when the information to beretrieved is not explicitly provided. This variableis also related to the type of match micro-aspectreferred to in the reading framework for PISA.While retrieving information tasks generally requireminimal inference, the more difficult tasks requiresome processing to match the requisite informationwith what is in the text, and to select relevantinformation through inference and prioritising.

Nature of competing informationNature of competing information is similar tothe plausibility of the distractors micro-aspectreferred to in the reading framework for PISA.Information is competing when the reader maymistakenly retrieve it because it resembles thecorrect information in one or more ways. Withinthis variable, the prominence of the correctinformation also needs to be considered: it willbe relatively easy to locate if it is in a heading,located near the beginning of the text, orrepeated several times.

Nature of textNature of text refers to aspects such as text lengthand complexity. Longer and more complex textsare harder to negotiate than simpler and shortertexts, all other things being equal. Althoughnature of text is a variable in all scales, itscharacteristics as a variable and probably itsimportance differ in retrieving information and inthe other two scales. This is probably becauseretrieving information tends to focus more onsmaller details of text than interpreting texts andreflection and evaluation do. The reader thereforeneeds to take comparatively little account of theoverall context (the text) in which the informationis located, especially if the text is well structuredand if the question or directive refers to it.

Task Variables for the InterpretingTexts Sub-Scale

A similar set of variables to those for retrievinginformation was identified for the PISAinterpreting texts tasks. In some respects, thevariables behave differently in relation to theinterpreting texts sub-scale, as outlined here andincorporated in the scale description in Figure33.

Type of interpretationType of interpretation is related to both the typeof match and the type of information micro-aspects referred to in the reading framework forPISA. Several processes have been identified thatform a hierarchy of complexity, and tend to beassociated with increasing difficulty. At theeasier end, readers need to identify a theme ormain idea. More difficult tasks requireunderstanding relationships within the text thatare an inherent part of its organisation and

Pro

fici

ency

sca

les

con

stru

ctio

n

201

meaning such as cause and effect, problem andsolution, goal and action, claim and evidence,motive and behaviour, precondition and action,explanation for an outcome, and time sequence.The most difficult tasks are of two kinds:construing meaning, which requires the reader tofocus on a word, a phrase or sentence, in orderto identify its particular effect in context andmay involve dealing with ambiguities or subtlenuances of meaning; and analogical reasoning,which requires comparing (finding similaritiesbetween parts of a text), contrasting (findingdifferences), or categorising ideas (identifyingsimilarities and differences in order to classifyinformation from within or relevant to the text).The latter may require either fitting examplesinto a category supplied in the text or findingtextual examples of a given category.

Explicitness of informationExplicitness of information encompasses howmuch direction the reader is given in focusingthe interpretation appropriately. The factors tobe considered may be indicated explicitly in theitem, and readily matched to one or more partsof the text. For a more difficult task, readersmay have to consider factors that are not statedeither in the item or the text. Generally,interpreting tasks require more inference thanthose in the retrieving information sub-scale, butthere is a wide range of demand within thisvariable in the interpreting texts sub-scale.

The difficulty of a task is also conditioned bythe number of factors that need to be considered,and the number of pieces of information that thereader needs to respond to the item. The greaterthe number of factors to consider or pieces ofinformation to be supplied, the more difficult thetask becomes.

Nature of competing informationNature of competing information functionssimilarly to the variable with the same name inthe retrieving information sub-scale. Forinterpreting items, however, explicit informationis likely to compete with the implicit informationrequired for the task.

Nature of textNature of text is relevant to the interpretingtexts sub-scale in several ways. The familiarity ofthe topic or theme of the text is significant. Textswith more familiar topics and more personalthemes tend to be easier to assimilate than thosewith themes that are remote from the reader’sexperience and that are more public orimpersonal. Similarly, texts in a familiar form orgenre are easier to manage than those inunfamiliar forms. The length and complexity ofa text (or the part of it that needs to beconsidered) also play a role in the difficulty of atask. The more text the reader has to take intoaccount, the more difficult a task is likely to be.

Task Variables for the Reflection andevaluation Sub-Scale

Some of the variables identified as relevant toretrieving information and interpreting texts arealso relevant to the PISA reflection and evaluationtasks, though operating differently to some extent.The reflection and evaluation dimension, however,is influenced by a wider range of task variablesthan the other reading aspects, as itemised here andincorporated in the scale description in Figure 34.

Type of reflectionFive reflecting processes were identified. Thereflection and evaluation sub-scale associates eachtype of reflection with a range of difficulties, buton average, the types of reflection vary incomplexity and tend to be associated withincreasing difficulty.

Connecting is at the lowest level of reflection.A connection requires the reader to make a basiclink between the text and knowledge fromoutside it. Connecting may involvedemonstrating an understanding of a constructor concept underlying the text. It may be amatter of content, such as finding an example ofa concept (e.g., ‘fact’ or ‘kindness’), or mayinvolve demonstrating an understanding of theform or linguistic function of features of a text,such as recognising the purpose of conventionalstructures or linguistic features. For example, thetask may require the reader to articulate therelationship between two parts of a text.

The following two types of reflection seem tobe more difficult than connecting, but werefound to be of similar difficulty.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

202

Explaining involves going beyond the text to givereasons for the presence or purpose of text-basedinformation or features that are consistent with theevidence presented in the text. For example, a readermay be asked to explain why there is a questionon an employment application form about how fara prospective employee lives from the workplace.

When comparing, the reader needs to findsimilarities (or differences) between something inthe text and something from outside it. In itemsdescribed as involving comparing, for example,readers may be asked to compare a character’sbehaviour in a narrative with the behaviour ofpeople they know; or to say whether their ownattitude to life resembles that expressed in apoem. Comparisons become more demanding asthe range of approaches from which thecomparison can be made is more constricted andless explicitly stated, and as the knowledge to bedrawn upon becomes less familiar and accessible.

The most difficult tasks involve eitherhypothesising or evaluating.

Hypothesising involves going beyond the textto offer explanations for text-based informationor features, which are consistent with the evidencepresented in the text. This type of reflection couldbe termed simply ‘explaining’ if the context is veryfamiliar to the reader, or if some supporting infor-mation were supplied in the text. However, hypo-thesising tends to be relatively demanding becausethe reader needs to synthesise several pieces of infor-mation from both inside and outside the text, andto generate plausible conclusions rather than toretrieve familiar knowledge and apply it to thetext. The degree to which signals are given aboutwhat kind of knowledge needs to be drawn on,and the degree of familiarity or concreteness ofthe knowledge, contribute to the difficulty ofhypothesising tasks.

Evaluating involves making a judgement abouta text as a whole or about some part or feature ofit. This kind of critical thinking requires the readerto call upon an internalised hierarchy of values(good/bad; appropriate/inappropriate; right/wrong).In PISA items, the process of evaluating is oftenrelated to textual structure, style, or internalcoherence. The reader may need to considercritically the writer’s point of view or the text’sstylistic appropriateness, logical consistency, orthematic coherence.

Nature of reader’s knowledgeThe nature of reader’s knowledge brought to

bear from outside the text, on a continuum fromgeneral (broad, diverse and commonly known)to specialised (narrow, specific and related to aparticular domain), is important in reflectingtasks. The term ‘knowledge’ includes attitudes,beliefs and opinions as well as factualknowledge—any cognitive phenomenon that thereader may draw on or possess. The domain andreference point for the knowledge andexperience on which the reader needs to draw toreflect upon the text contribute to the difficultyof the task. When the knowledge must be highlypersonal and subjective, there is a lower demandbecause there is a very wide range of appropriateresponses and no strong external standardagainst which to judge them. As the knowledgeto be drawn upon becomes more specialised, lesspersonal and more externally referenced, therange of appropriate responses narrows. Forexample, if knowledge about literary styles orinformation about public events is a necessaryprerequisite for a sensible reflection, the task islikely to be more difficult than a task that asksfor a personal opinion about parental behaviour.

In addition, difficulty is affected by the extentto which the reader has to generate the terms ofthe reflection. In some cases, the item or the textfully define what needs to be considered in thecourse of the reflection. In other cases, readersmust infer or generate their own set of factors asthe basis for an hypothesis or evaluation.

Nature of textNature of text resembles the variable describedfor the interpreting texts sub-scale, above. Reflectingupon items that are based on simple, short texts(or simple short parts of longer texts) are easierthan those based on longer, more complex texts.The degree of textual complexity is judged interms of its content, linguistic form, or structure.

Nature of understanding of the textNature of understanding of the text is close tothe ‘type of match’ micro-process described inthe reading framework. It encompasses the kindof processing the reader needs to engage inbefore reflecting on the text. If the reflection isto be based on locating a small section ofexplicitly indicated text (locating), the task willtend to be easier than if it is based on acontradiction in the text that must first beinferred (inferring a logical relationship). Thetype of understanding of the text may entail P

rofi

cien

cy s

cale

s co

nst

ruct

ion

203

2 Tasks at this level require the reader to locate oneor more pieces of information, which may need tomeet multiple criteria. Typically some competinginformation is present.

• Finds explicitly stated information in a notice about animmunisation program in the work place. The relevantinformation is near the beginning of the notice butembedded in a complex sentence. There is significantcompeting information. [Flu 2 R077Q02]

partial or a broad, general understanding of themain idea. Reflecting items become moredifficult as the reflection requires fuller, deeperand richer textual understanding.

Retrieving Information Sub-Scale

The PISA 2000 retrieving information sub-scaleis described in Figure 32.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

204

4 Tasks at this level require the reader to locate andpossibly sequence or combine multiple pieces ofembedded information. Each piece of informationmay need to meet multiple criteria. Typically thecontent and form of the text are unfamiliar. Thereader may need to make inferences in order todetermine which information in the text is relevantto the task.

• Marks the positions of two actors on a stage diagram byreferring to a play script. The two pieces of informationneeded are embedded in a stage direction in a long anddense text. [Amanda 4 R216Q04]

3 Tasks at this level require the reader to locate andin some cases recognise the links between pieces ofinformation, each of which may be required tomeet multiple criteria. Typically there is prominentcompeting information.

• Identifies the starting date of a graph showing the depthof a lake over several thousand years. The informationis given in prose in the introduction, but the date is notexplicitly marked on the graph’s scale. The first markeddate on the scale constitutes strong competinginformation. [Chad 3A R040Q03A]

• Finds the relevant section of a magazine article foryoung people about DNA testing by matching with aterm in the question. The precise information that isrequired is surrounded by a good deal of competinginformation, some of which appears to be contradictory.[Police 4 R100Q04]

1 Tasks at this level require the reader to locate oneor more independent pieces of explicitly statedinformation. Typically there is a single criterionthat the located information must meet and thereis little if any competing information in the text.

Below Insufficient information to describe features ofLevel 1 tasks at this level.

• Finds a literal match between a term in the question andthe required information. The text is a long narrative;however, the reader is specifically directed to therelevant passage which is near the beginning of thestory. [Gift 6 R119Q06]

• Locates a single explicitly stated piece of information ina notice about job services. The required information issignalled by a heading in the text that literally matches aterm in the question. [Personnel 1 R234Q01]

Figure 32: The Retrieving Information Sub-Scale

5 Tasks at this level require the reader to locate andpossibly sequence or combine multiple pieces ofdeeply embedded information, some of which maybe outside the main body of the text. Typically thecontent and form of the text are unfamiliar. Thereader needs to make high-level inferences in orderto determine which information in the text isrelevant to the task. There is highly plausibleand/or extensive competing information.

• Locates and combines two pieces of numericinformation on a diagram showing the structure of thelabour force. One of the relevant pieces of informationis found in a footnote to the caption of the diagram.[Labour 3 R088Q03, score category 2]

Distinguishing features of tasks at each level:

In a typical task at this level, the reader:

Pro

fici

ency

sca

les

con

stru

ctio

n

205

Interpreting Texts Sub-Scale

The PISA 2000 interpreting texts sub-scale isdescribed in Figure 33.

5 Tasks at this level typically require eitherconstruing of nuanced language or a full anddetailed understanding of a text.

• Analyses the descriptions of several cases in order todetermine the appropriate labour force status categories.Infers the criteria for assignment of each case from a fullunderstanding of the structure and content of a treediagram. Some of the relevant information is infootnotes and is therefore not prominent. [Labour 4R088Q04, score category 2]

4 Tasks at this level typically involve eitherunderstanding and applying categories in anunfamiliar context; or construing the meaning ofsections of text by taking into account the text as awhole. These tasks require a high level of text-based inference. Competing information at thislevel is typically in the form of ambiguities, ideasthat are contrary to expectation, or ideas that arenegatively worded.

• Construes the meaning of a sentence in context bytaking into account information across a large section oftext. The sentence in isolation is ambiguous and thereare apparently plausible alternative readings. [Gift 4R119Q04]



3 Tasks at this level require the reader to considermany elements of the text. The reader typically needsto integrate several parts of a text in order to identifya main idea, understand a relationship, or construethe meaning of a word or phrase. The reader mayneed to take many criteria into account whencomparing, contrasting or categorising. Often therequired information is not prominent but implicitin the text, and the reader may be distracted bycompeting information that is explicit in the text.

• Explains a character’s behaviour in a given situation bylinking a chain of events and descriptions scatteredthrough a long story. [Gift 8 R119Q08]

• Integrates information from two graphic displays toinfer the concurrence of two events. The two graphicsuse different conventions, and the reader must interpretthe structure of both graphics in order to translate therelevant information from one form to the other. [Chad6 R040Q06]

2 Some tasks require inferring the main idea in a textwhen the information is not prominent. Othersrequire understanding relationships or construingmeaning within a limited part of the text, bymaking low-level inferences. Tasks at this levelmay involve forming or applying simple categories.

• Uses information in a brief introduction to infer themain theme of an extract from a play script. [Amanda 1R216Q01]

1 Tasks at this level typically require inferring themain theme or author’s purpose in a text about afamiliar topic when the idea is prominent, eitherbecause it is repeated or because it appears early inthe text. There is little if any competinginformation in the text.

• Recognises the main theme of a magazine article forteenagers about sports shoes. The theme is implied inthe sub-heading and repeated several times in the bodyof the article. [Runners 1 R110Q01]


Figure 33: The Interpreting Texts Sub-Scale

2 Some tasks at this level require readers to makeconnections or comparisons between the text andoutside knowledge. For others, readers need todraw on personal experience and attitudes toexplain a feature of the text. The tasks require abroad understanding of the text.

• Compares claims made in two short texts (letters aboutgraffiti) with his/her own views and attitudes. Broadunderstanding of at least one of the two letter writers’opinions needs to be demonstrated. [Graffiti 6AR081Q06A]

PIS

A 2

00

0 t

ech

nic

al r

epor

t

206

Reflection and Evaluation Sub-Scale

The PISA 2000 reflection and evaluation sub-scale is described in Figure 34.



5 Tasks at this level require critical evaluation orhypothesis, and may draw on specialisedknowledge. Typically these tasks require readers todeal with concepts that are contrary toexpectations, and to draw on a deep understandingof long or complex texts.

• Hypothesises about an unexpected phenomenon: that anaid agency gives relatively low levels of support to avery poor country. Takes account of inconspicuous aswell as more obvious information in a complex text ona relatively unfamiliar topic (foreign aid) [PlanInternational 4B R099Q4B, score category 2]

• Evaluates the appropriateness of an apparentlycontradictory section of a notice about an immunisationprogram in the workplace, taking into account thepersuasive intent of the text and/or its logical coherence.[Flu 5 R077Q05]

4 Tasks at this level typically require readers tocritically evaluate a text, or hypothesise aboutinformation in the text, using formal or publicknowledge. Readers must demonstrate an accurateunderstanding of the text, which may be long orcomplex.

• Evaluates the writer’s craft by comparing two shortletters on the topic of graffiti. Readers need to draw ontheir understanding of what constitutes good style inwriting. [Graffiti 6B R081Q06B]

3 Tasks at this level may require connections,comparisons and explanations, or they may requirethe reader to evaluate a feature of the text. Sometasks require readers to demonstrate a detailedunderstanding of the text in relation to familiar,everyday knowledge. Other tasks do not requiredetailed text comprehension but require the readerto draw on less common knowledge. The readermay need to infer the factors to be considered.

• Connects his/her own concepts of compassion andcruelty with the behaviour of a character in a narrative,and identifies relevant evidence of such behaviour in thetext. [Gift 9 R119Q09, score category 2]

•Evaluates the form of a text in relation to its purpose.The text is a tree diagram showing the structure of thelabour force. [Labour 7 R088Q07]

1 Tasks at this level require readers to make a simpleconnection between information in the text andcommon, everyday knowledge. The reader isexplicitly directed to consider relevant factors inthe task and in the text.

• Makes a connection by articulating the relationshipbetween two parts of a single, specified sentence in amagazine article for young people about sports shoes.[Runners 6 R110Q06]


Figure 34: The Reflection and Evaluation Sub-Scale

Combined Reading Literacy Scale

The PISA 2000 combined reading literacy scaleis described in Figure 35.


5 The reader must: sequence or combine several pieces of deeply embedded information, possiblydrawing on information from outside the main body of the text; construe the meaning of linguisticnuances in a section of text; or make evaluative judgements or hypotheses, drawing on specialisedknowledge. The reader is generally required to demonstrate a full, detailed understanding of adense, complex or unfamiliar text, in content or form, or one that involves concepts that arecontrary to expectations. The reader will often have to make inferences to determine whichinformation in the text is relevant, and to deal with prominent or extensive competing information.

4 The reader must: locate, sequence or combine several pieces of embedded information; infer the meaningof a section of text by considering the text as a whole; understand and apply categories in an unfamiliarcontext; or hypothesise about or critically evaluate a text, using formal or public knowledge. The readermust draw on an accurate understanding of long or complex texts in which competing information maytake the form of ideas that are ambiguous, contrary to expectation, or negatively worded.

3 The reader must: recognise the links between pieces of information that have to meet multiplecriteria; integrate several parts of a text to identify a main idea, understand a relationship orconstrue the meaning of a word or phrase; make connections and comparisons; or explain orevaluate a textual feature. The reader must take into account many features when comparing,contrasting or categorising. Often the required information is not prominent but implicit in the textor obscured by similar information.

2 The reader must: locate one or more pieces of information that may be needed to meet multiplecriteria; identify the main idea, understand relationships or construe meaning within a limited partof the text by making low-level inferences; form or apply simple categories to explain something ina text by drawing on personal experience and attitudes; or make connections or comparisonsbetween the text and everyday outside knowledge. The reader must often deal with competinginformation.

1 The reader must: locate one or more independent pieces of explicitly stated information accordingto a single criterion; identify the main theme or author’s purpose in a text about a familiar topic; ormake a simple connection between information in the text and common, everyday knowledge.Typically, the requisite information is prominent and there is little, if any, competing information.The reader is explicitly directed to consider relevant factors in the task and in the text.

Below There is insufficient information to describe features of tasks at this level.Level 1

Figure 35: The Combined Reading Literacy Scale

Cut-off Points for Reading

The definition of proficiency levels for reading isbased on a bandwidth of 0.8 logits, and an RPlevel of 0.62 (see ‘Defining Proficiency Levels’earlier in this chapter).

Using these criteria, Figure 36 gives the levelboundaries for the three reading aspect scalesand the combined reading literacy scale, basedon the PISA scale with a mean of 500 and astandard deviation of 100 (see Chapter 13).

Boundary Cut-Off Point on PISA Scale

Level 4 / Level 5 625.6




Below Level 1 / Level 1 334.8

Figure 36: Combined Reading Literacy Scale LevelCut-Off Points P

rofi

cien

cy s

cale

s co

nst

ruct

ion

207

Mathematical Literacy

What is Being Assessed?

PISA’s mathematical literacy tasks assessstudents’ ability to recognise and interpretproblems encountered in their world; to translatethe problems into a mathematical context; to usemathematical knowledge and procedures fromvarious areas to solve a problem within itsmathematical context; to interpret results interms of the original problem; and to reflectupon methods applied and communicate theoutcomes.

In the PISA mathematical literacy scale, thekey elements defining the increasing difficulty oftasks at successive levels are the number andcomplexity of processing or computation steps;connection and integration demands (from useof separate elements, to integration of differentpieces of information, different representations,or different mathematical tools or knowledge);and representation, modelling, interpretation andreflection demands (from recognising and usinga familiar, well-formulated model, toformulating, translating or creating a usefulmodel within an unfamiliar context, and usinginsight, reasoning, argumentation andgeneralisation).

The factors that influence item difficulty, andtherefore contribute to a definition ofmathematical proficiency, include the following:• the kind and degree of interpretation and

reflection required, which includes the natureof demands arising from the context of theproblem; the extent to which themathematical demands of the problem areapparent or to which students must imposetheir own mathematical construction; and theextent to which insight, complex reasoningand generalisation are required; and

• the kind and level of mathematical skillrequired, which ranges from single-stepproblems requiring students to reproducebasic mathematical facts and simplecomputation processes to multi-step problemsinvolving more advanced mathematicalknowledge; and complex decision-making,information processing, and problem solvingand modelling skills.A single scale was proposed for PISA

mathematical literacy. Five proficiency levels canbe defined although an initial set of descriptions

of three levels lowest, middle, andhighest was developed for use in PISA 2000.

At the lowest level of proficiency that isdescribed, students typically negotiate single-stepprocesses that involve recognising familiarcontexts and mathematically well-formulatedproblems, reproducing well-known mathematicalfacts or processes, and applying simplecomputational skills.

At higher levels of proficiency, studentstypically carry out more complex tasks involvingmore than a single processing step, and combinedifferent pieces of information or interpretdifferent representations of mathematicalconcepts or information, recognising whichelements are relevant and important. To identifysolutions, they typically work with givenmathematical models or formulations, which arefrequently in algebraic form, or carry out a smallsequence of processing or a number ofcalculation steps to produce a solution.

At the highest level of proficiency that isdescribed, students take a more creative andactive role in their approach to solvingmathematical problems. They typically interpretmore complex information and negotiate anumber of processing steps. They typicallyproduce a formulation of a problem and oftendevelop a suitable model that facilitates solvingthe problem. Students at this level typicallyidentify and apply relevant tools and knowledgefrequently in an unfamiliar context, demonstrateinsight to identify a suitable solution strategy,and display other higher order cognitiveprocesses such as generalisation, reasoning andargumentation to explain or communicateresults.

Given that the amount of data available fromthe PISA 2000 study is limited becausemathematics was a minor test domain, it wasdecided that it would be unwise to establish cut-off points between five levels.

The PISA 2000 mathematical literacy scale isdescribed in Figure 37.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

208

209

Figure 37: The Mathematical Literacy Scale

400

500

600

700

800

300

Highest Level Described

• Show insight in the solution of problems;• Develop a mathematical interpretation and formulation of problems set in

a real-world context;• Identify relevant mathematical tools or methods for solution of problems

in unfamiliar contexts;• Solve problems involving several steps;• Reflect on results and generalise findings; and• Use reasoning and mathematical argument to explain solutions and

communicate outcomes.

Middle Level Described

• Interpret, link and integrate different information in order to solve aproblem;

• Work with and connect different mathematical representations of aproblem;

• Use and manipulate given mathematical models to solve a problem;• Use symbolic language to solve problems; and• Solve problems involving a small number of steps.

Lowest Level Described

• Recognise familiar elements in a problem, and recall knowledge relevantto the problem;

• Reproduce known facts or procedures to solve a problem;• Apply mathematical knowledge to solve problems that are simply

expressed and are either already formulated in mathematical terms, orwhere the mathematical formulation is straightforward; and

• Solve problems involving only one or two steps.P

rofi

cien

cy s

cale

s co

nst

ruct

ion

PIS

A 2

00

0 t

ech

nic

al r

epor

t

210

Illustrating the PISA MathematicalLiteracy Scale

The items from two different units (Apples andSpeed of Racing Car) have been selected toillustrate the PISA mathematical literacy scale.Figure 38 shows where each item on the scale islocated, and summarises the item to show how itillustrates the scale.

Item PISA Scale Comment

Apples 732 Students are given a hypothetical scenario involving planting an orchard ofM136Q03 apple trees in a square pattern, with a ‘row’ of protective conifer trees

around the square. This item requires students to show insight intomathematical functions by comparing the growth of a linear function withthat of a quadratic function. Students are required to construct a verbaldescription of a generalised pattern, and to create an argument usingalgebra. Students need to understand both the algebraic expressions used todescribe the pattern and the underlying functional relationships in such away that they can see and explain the generalisation of these relationshipsin an unfamiliar context. A chain of reasoning and communication of thisreasoning in a written explanation is required.

Apples 664 Students are given two algebraic expressions that describe the growth in M136Q02 the number of trees (apple trees and the protective conifers) as the orchard

increases in size, and are asked to find a value for which the twoexpressions coincide. This item requires students to interpret expressionscontaining words and symbols—to link different representations (pictorial,verbal and algebraic) for two relationships (one quadratic and one linear).Students have to find a strategy for determining when the two functionswill have the same solution (for example, by trial and error, or by algebraicmeans); and to communicate the result by explaining the reasoning andcalculation steps involved.

Racing Car 664 Students are given a graph showing the speed of a racing car as it progresses M159Q05 at various distances along a race track. For this item, they are also given a

set of plans of hypothetical tracks and asked to identify which could havegiven rise to the graph. The item requires students to understand andinterpret a graphical representation of a physical relationship (speed anddistance of a car) and relate it to the physical world. Students need to linkand integrate two very different visual representations of the progress of acar around a race track and to identify and select the correct option fromamong challenging alternatives.

Apples 557 Students are given a hypothetical scenario involving planting an orchard of M136Q01 apple trees in a square pattern, with a ‘row’ of protective conifer trees

around the square. They are asked to complete a table of values generatedby the functions that describe the number of trees as the size of the orchardis increased. This item requires students to interpret a written descriptionof a problem situation, to link this to a tabular representation of some ofthe information, and to recognise and extend a pattern. Students need towork with given models in order to relate two different representations(pictorial and tabular) of two relationships (one quadratic and one linear)to extend the pattern.

Pro

fici

ency

sca

les

con

stru

ctio

n

211

Scientific Literacy

What is Being Assessed?

PISA’s scientific literacy tasks assess students’ability to use scientific knowledge(understanding scientific concepts); to recognisescientific questions and to identify what isinvolved in scientific investigations(understanding the nature of scientificinvestigation); to relate scientific data to claimsand conclusions (using scientific evidence); andto communicate these aspects of science.

In the PISA scientific literacy scale, the keyelements defining the increasing difficulty oftasks at successive levels are the complexity ofthe concepts encountered, the amount ofinformation given, the type of reasoningrequired, and the context and sometimes theformat and presentation of the items. There is aprogression of difficulty across tasks in the useof scientific knowledge involving:• the recall of simple scientific knowledge,

common scientific knowledge or givenquestions, variables or data;

• scientific concepts or questions and details ofinvestigations; and

• elaborated scientific concepts or additionalinformation or a chain of reasoning; throughto simple scientific conceptual models oranalysis of investigations or evidence foralternative perspectives.The scientific proficiency scale is based on a

definition of scientific literacy linked to fourprocesses, as formulated in the PISA scienceframework (OECD, 1999a):• Understanding scientific concepts concerns the

ability to make use of scientific knowledgeand to demonstrate an understanding ofscientific concepts by applying scientific ideas,information, or appropriate concepts (notgiven in the stimulus or test item) to a givensituation. This may involve explainingrelationships, scientific events or phenomenaor possible causes for the changes that areindicated, or making predictions about theeffects of given changes or identifying thefactors that would influence a given outcome.

tem PISA Scale Comment

Racing Car 502 Students are given a graph showing the speed of a racing car as it M159Q01 progresses at various distances along a race track. For this item they are

asked to interpret the graph to find a distance that satisfies a givencondition. The item requires students to interpret a graphicalrepresentation of a physical relationship (distance and speed of a cartravelling on a track of unknown shape). Students need to interpret thegraph by linking a verbal description with two particular features of thegraph (one simple and straightforward, and one requiring a deeperunderstanding of several elements of the graph and what it represents),then identify and read the required information from the graph, selectingthe best option from the given alternatives.

Racing Car 423 In this item, students are asked to interpret the speed of the car at a M159Q03 particular point in the graph. The item requires students to read

information from a graph representing a physical relationship (speed anddistance of a car). Students need to identify the place in the graph referredto in a verbal description, recognise what is happening to the speed of thevehicle at that point, then select the best matching option from amonggiven alternatives.

Racing Car 412 This item asks students to read from the graph a single value that satisfies M159Q02 a simple condition. The item requires students to read information from a

graph representing a physical relationship (speed and distance of a car).Students need to identify one specified feature of the graph (the display ofspeed), read directly from the graph a value that minimises the feature,then select the best match from among given alternatives.

Figure 38: Typical Mathematical Literacy Tasks

Figure 39: The Scientific Literacy Scale

400

500

600

700

800

300

Highest Level Described

• Create or use simple conceptual models to make predictions or giveexplanations;

• Analyse scientific investigations in relation to, for example, experimentaldesign, identification of idea being tested;

• Relate data as evidence to evaluate alternative viewpoints or differentperspectives; and

• Communicate scientific arguments and/or descriptions in detail and withprecision.

Middle Level Described

• Use scientific concepts in making predictions or giving explanations;• Recognise questions that can be answered by scientific investigation

and/or identify details of what is involved in a scientific investigation; and• Select relevant information from competing data or chains of reasoning

in drawing or evaluating conclusions.

Lowest Level Described

• Recall simple scientific factual knowledge (e.g., names, facts,terminology, simple rules); and

• Use common knowledge in drawing or evaluating conclusions.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

212

• Understanding the nature of scientificinvestigation concerns the ability to recognisequestions that can be scientifically investigatedand to be aware of what is involved in suchinvestigations. This may involve recognisingquestions that could be or are being answeredin an investigation, identifying the question oridea that was being (or could have been) testedin a given investigation, distinguishing questionsthat can and cannot be answered by scientificinvestigation, or suggesting a question that couldbe investigated scientifically in a given situation.It may also involve identifying evidence/dataneeded to test an explanation or explore anissue, requiring the identification or recognitionof variables that should be compared, changedor controlled, additional information thatwould be needed, or action that should betaken so that relevant data can be collected.

• Using scientific evidence concerns the abilityto make sense of scientific data as evidencefor claims or conclusions. This may involveproducing a conclusion from given scientificevidence/data or selecting the conclusion thatfits the data from alternatives. It may alsoinvolve giving reasons for or against a givenconclusion in terms of the data provided, oridentifying the assumptions made in reachinga conclusion.

• Communicating scientific descriptions orarguments concerns the ability tocommunicate scientific descriptions,arguments and explanations to others. Thisinvolves communicating valid conclusionsfrom available evidence/data to a specifiedaudience. It may involve producing anargument or an explanation based on thesituation and data given, or providing relevantadditional information, expressedappropriately and clearly for the audience.Developing the proficiency scale for science

involved an iterative process that considered thetheoretical basis of the science framework andthe item analyses from the main study.

Consideration of the framework suggested ascale of scientific literacy which, at its lowerlevels, would involve being able to grasp easierscientific concepts in familiar situations such asrecognising questions that can or cannot bedecided by scientific investigation, recallingscientific facts, or using common scientificknowledge to draw conclusions.

Students at higher levels of proficiency wouldbe able to apply more cognitively demandingconcepts in more complex and occasionallyunfamiliar contexts. This could involve creatingmodels to make predictions, using data asevidence to evaluate different perspectives orcommunicating scientific arguments withprecision.

The difficulty levels indicated by the analysisof the science items in PISA 2000 also gaveuseful information, which was considered alongwith the theoretical framework in developingscale descriptions.

Only limited data were available from thePISA 2000 study, where science was a minor testdomain. For this reason, a strict definition oflevels could not be confidently recommended.The precise definition of levels and investigationof sub-scales will be possible for PISA 2006,using the data from science as a major testdomain.

The PISA 2000 scientific literacy scale isdescribed in Figure 39.

Illustrating the PISA ScientificLiteracy Scale

Several science items have been recommendedfor public release on the basis of the followingcriteria. In addition to indicating the assessmentstyle used in PISA, these items provide a guide tothe descriptive scientific proficiency scales. Theselected items: • exemplify the PISA philosophy of assessing

the students’ preparedness for future liferather than simply testing curriculum-basedknowledge;

• cover a range of proficiency levels;• give consistent, reliable item analyses and

were not flagged as causing problems inindividual countries;

• represent several formats: multiple-choice,short constructed-response, open response;and

• include some partial credit items.The Semmelweis and Ozone units, containing

a total of 10 score points, were released afterPISA 2000. Characteristics of these units aresummarised in Figures 40 and 41.

Pro

fici

ency

sca

les

con

stru

ctio

n

213

PIS

A 2

00

0 t

ech

nic

al r

epor

t

214

SemmelweisThe historical example about Semmelweis’discoveries concerning puerperal fever waschosen for several reasons:• It illustrates the importance of systematic data

collection for explaining a devastatingphenomenon, the death of many women inmaternity wards;

• It demonstrates the need to consideralternative explanations based on the datacollected and on additional observationsregarding the behaviour of the medicalpersonnel;

• It shows how scientific reasoning based onevidence can refute popular beliefs thatprevent a problem from being addressedeffectively; and

• It allows students to expand on the exampleby bringing in their own scientific knowledge.

collected to defend the idea that earthquakes arean unlikely cause of the disease. The graphsshow a similar variation in death rate over time,and the first ward consistently has a higherdeath rate than the second ward. If earthquakeswere the cause, death rates in both wards shouldbe about equal. The graphs suggest thatsomething having to do with the wardsthemselves would explain the difference.

To obtain full credit, students needed to referto the idea that death rates in both wards shouldhave been similar over time if earthquakes werethe cause. Some students came up with answersthat did not refer to Semmelweis’ findings, butto a characteristic of earthquakes that made itvery unlikely that they were the cause, such asthat they are infrequent while the fever isconstantly present. Others came up with veryoriginal well-taken statements such as ‘if it wereearthquakes, why do only women get the

disease, and not men?’or ‘ if so, womenoutside the wardswould also get thatfever’. Although it couldbe argued that thesestudents did notconsider the datacollected by Semmelweis,as the item asks, theywere given partial creditbecause theydemonstrated a certainability to use scientificfacts in drawing aconclusion.

This partial credit item is a good example ofhow one item can be used to exemplify twoparts of the proficiency scale. The first scorepoint is a moderately difficult item because itinvolves using scientific evidence to relate datasystematically to possible conclusions using achain of reasoning that is not given to thestudents. The second score point is at a higherlevel of proficiency because the students have torelate the data given as evidence to evaluatedifferent perspectives.

Sample Item S195Q04 (multiple-choice)

Semmelweis gets an idea from his observations.Sample Item S195Q04 asks students to identifywhich from a set of ideas was most relevant to

Unit Name Unit Q# Score Location ScientificIdentification on PISA Scale Process

Semmelweis 195 2 2 679 c

Semmelweis 195 2 1 651 c

Semmelweis 195 4 1 506 b

Semmelweis 195 5 1 480 a

Semmelweis 195 6 1 521 a

Note: The a, b, and c in the right-hand column refer to the scientific processes: a = understanding scientific concepts; b = understanding the nature of scientificinvestigation; and c = using scientific evidence.

Figure 40: Characteristics of the Science Unit, Semmelweis

Sample Item S195Q02 (open constructed-response)

Sample Item S195Q02 is the first in a unit thatintroduces students to Semmelweis, who took upa position in a Viennese hospital in the 1840sand became puzzled by the remarkably highdeath rate due to puerperal fever in one of thematernity wards. This information is presentedin text and graphs, and the suggestion is thenmade that puerperal fever may be caused byextraterrestrial influences or natural disasters—not uncommon thinking in Semmelweis’ time.Semmelweis tried on several occasions toconvince his colleagues to consider more rationalexplanations. Students are invited to imaginethemselves in his position, and to use the data he

reducing the incidence of puerperal fever.Students need to put two pieces of relevantinformation from the text together: thebehaviour of a medical student and the death ofSemmelweis’ friend from puerperal fever afterdissection. This item exemplifies a low tomoderate level of proficiency because it asksstudents to refer to given data or information todraw their conclusion.


Nowadays, most people are well aware thatgerms cause many diseases, and can be killed byheat. Perhaps not many people realise thatroutine procedures in hospitals make use of thisto reduce risks of fever outbreak.

This item invites the student to apply thecommon scientific knowledge that heat killsbacteria to explain why these procedures areeffective. This item is an example of a level oflow to moderate difficulty.


Sample Item S195Q06 goes beyond thehistorical example to elicit common scientificknowledge needed to explain a scientificphenomenon. Students are asked to explain whyantibiotics have become less effective over time.To respond correctly, they must know thatfrequent, extended use of antibiotics builds upstrains of bacteria resistant to the antibiotics’initially lethal effects.

This item is located at a moderate level on theproficiency scale because it asks students to usescientific concepts (as opposed to commonscientific knowledge, which is at a lower level) togive explanations.

OzoneThe subject of the Ozone unit is the effect on theenvironment of the depletion of the ozone layer,and is very important as it concerns humansurvival. It has received much media attention.The unit is significant too, because it points tothe successful intervention by the internationalcommunity at a meeting in Montreal an earlyexample of successful global consultation onmatters that extend beyond national boundaries.

The text presented to the students sets thecontext by explaining how ozone molecules andthe ozone layer are formed, and how the layerforms a protective barrier by absorbing ultra-violet radiation from the sun’s rays. Both goodand bad effects of ozone are described. Thecontext of the items involves genuine, importantissues, and gives the opportunity to askquestions about physical, chemical andbiological aspects of science.

Sample Item S253Q01 (open constructed-response)In the first Ozone sample item, the formation ofozone molecules is presented in a novel way. Acomic strip shows three stages of oxygenmolecules being split under the sun’s influenceand then recombined into ozone molecules.Students need to interpret this information andcommunicate it to a person with limitedscientific knowledge. The intention is to assessstudents’ ability to communicate their owninterpretation, which requires an open-endedresponse format.

Many different responses can be given, but togain full credit, students must describe what ishappening in at least two of the three stages ofthe comic strip.

Unit Name Unit Q# Score Location ScientificIdentification on PISA Scale Process

Ozone 253 1 2 695 d

Ozone 253 1 1 641 d

Ozone 253 2 1 655 c

Ozone 253 5 1 560 a

Ozone 270 3 1 542 b

Note: The a, b, c, d in the right-hand column refer to the scientific processes:a = understanding scientific concepts; b = understanding the nature of scientificinvestigation; c = using scientific evidence; and d = communicating scientificdescriptions or arguments.

Figure 41: Characteristics of the Science Unit, Ozone Pro

fici

ency

sca

les

con

stru

ctio

n

215

Students who received partial credit for thisitem are regarded as working at a moderate tohigh level of proficiency because they couldcommunicate a simple scientific description. Fullcredit requires more precision and detail this isan example of the highest proficiency level.


Sample Item S253Q02 requires students to linkpart of the text to their own experience ofweather conditions (thunderstorms occurringrelatively close to Earth) in order to draw aconclusion about the nature of the ozoneproduced (‘good’ or ‘bad’). The item requiresdrawing an inference or going beyond the statedinformation.

This is an example of an item at a moderateto high level of proficiency because studentsmust relate data systematically to statements ofpossible conclusions using a chain of reasoningthat is not specifically given to them.


In Sample Item S253Q05, students need todemonstrate specific knowledge of a possibleconsequence for human health of the depletionof the ozone layer. The Marking Guide is precisein demanding a particular type of cancer (skincancer). This item is regarded as requiring amoderate level of proficiency because studentsmust use a specific scientific concept inanswering it.

Sample Item S270Q03 (complex multiple-choice)

The final Ozone sample item emphasises theinternational significance of scientific research inhelping to solve environmental problems, such asthe ones raised at the Montreal meeting. Theitem requires students to recognise questions thatcan be answered by scientific investigation and isan example of a moderate level of proficiency.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

216

Constructing andValidating theQuestionnaire indices

Wolfram Schulz

In several cases, the raw data from the contextquestionnaires were recoded or scaled toproduce new variables, referred to as indices.Indices based on direct recoding of responses toone or more variables are called simple indices.Many PISA indices, however, have beenconstructed by applying a scaling methodology,and are referred to as complex indices. Theysummarise student or school representatives’(typically principals) responses to a series ofrelated questions selected from larger constructson the basis of theoretical considerations andprevious research. Structural equation modellingwas used to confirm the theoretically expectedbehaviour of the indices and to validate theircomparability across countries. For this purpose,a model was estimated separately for eachcountry and, collectively, for all OECDcountries.1

This chapter describes the construction of allindices, both simple and complex. It presents theresults of a cross-country validation of thestudent and school indices, and providesdescriptive statistics for selected questionnairevariables. The analysis was done using StructuralEquation Modelling (SEM) for a ConfirmatoryFactor Analysis (CFA) of questionnaire items.

CFA was used to validate the indices, and itemresponse theory techniques were used to producescale scores. An index involving multiplequestions and student responses was scaled usingthe Rasch item response model, and the scalescore is a weighted maximum likelihood estimate,

1 The Netherlands was not included because of its lowparticipation rate.

referred to as a Warm estimator (Warm, 1985)(also see Chapter 9). Dichotomous responsecategories (i.e., yes/no) produce only oneparameter estimate (labelled ‘Delta’ in the figuresreporting item parameters in this chapter). Ratingscales allowing graded responses (i.e., never, somelessons, most lessons, every lesson) produce botha parameter estimate and an estimate of thethresholds between n categories labelled ‘Tau(1)’,‘Tau(2)’, etc. in the figures reporting itemparameters in this chapter.

The scaling proceeded in three steps:• Item parameters were estimated from

calibration samples of students and schools.The student calibration sample consisted ofsub-samples of 500 randomly selectedstudents, and the school calibration sampleconsisted of sub-samples of 99 schools, fromeach participating OECD country.2 Non-OECD countries did not contribute to thecomputation of the item parameters.

2 For the international options, the item responsetheory indices were transformed using the same proce-dures as those applied to the other indices, but onlyOECD countries that had participated in the internationaloptions were included. For Belgium, only the FlemishCommunity chose to include the international options,and it was given equal weight and included. For theUnited Kingdom where only Scotland participated, thedata were not included in the standardisation. Therationale was that in Belgium, the participating regionrepresented half of the population whereas students inScotland are only a small part of the student populationin the United Kingdom.

Chapter 17

• Estimates were computed for all students andall schools by anchoring the item parametersobtained in the preceding step.

• Indices were then standardised so that themean of the index value for the OECDstudent population was zero and the standarddeviation was one (countries being givenequal weight in the standardisation process).This item response theory approach was

adopted for producing index ‘scores’, aftervalidation of the indices with CFA procedures,because (i) it was consistent with the intention thatthe items simply be added together to produce asingle index, and (ii) it provided an elegant way ofdealing with missing item-response data.

Validation Procedures ofIndices

Structural Equation Modelling (SEM) was usedto confirm theoretically expected dimensions orto re-specify the dimensional structure. SEMtakes the measurement error associated with theindicators into account and therefore providesmore reliable estimates for the latent variablesthan classical psychometric methods likeexploratory factor analyses or simplecomputations of alpha reliabilities.

Basically, in CFA with SEM, an expectedcovariance matrix is fitted to a theoretical factorstructure by minimising the differences betweenan expected covariance matrix (Σ, based upon agiven model) and the observed covariance matrix(S, computed from the data).

Measures for the overall fit of a model arethen obtained by comparing the expected Σmatrix with the observed S matrix. If thedifferences are close to zero, then the model fitsthe data, while if they are rather large the modeldoes not fit the data and some re-specificationmay be necessary or, if this is not possible, thetheoretical model should be rejected. Generally,model fit indices are approximations and need tobe interpreted with care. In particular, the chi-squared test statistic for the null hypothesis ofΣ=S becomes a rather poor fit measure withlarger sample sizes because even smalldifferences between matrices are given assignificant deviations.

Though the assessment of model fit was basedon a variety of measures, this report presentsonly the following two indices to illustrate thecomparative model fits:

• The Root Mean Square Error ofApproximation (RMSEA) measures thediscrepancy per degree of freedom for themodel (Browne and Cudeck, 1993). Values of0.05 or less indicate a close fit, values of 0.08or more indicate a reasonable to a large errorof approximation and values greater than 1.0should lead to the rejection of a model.

• The Adjusted Goodness-of-Fit-Index (AGFI)indicates the amount of variance in Sexplained with Σ. Values close to 1.0 indicatea good model fit.

• The difference between a specified model anda null model is measured by the Normed FitIndex (NFI), Non-Normed Fit Index (NNFI)and Comparative Fit Index (CFI). Like AGFI,values close to 1.0 for these three indicesshow a good model fit.Additionally, the explained variance (or item

reliability) in the manifest variables was taken asan additional indicator of model fit i.e., if thelatent variables hardly explain any variance inthe manifest variables, the assumed factorstructure is not confirmed even if the overallmodel fit is reasonable.

Model estimation was done with LISREL(Jöreskog and Sörbom, 1993) using theSTREAMS (Gustafsson and Stahl, 2000)interface program which gives the researcher thepossibility of estimating models for groups ofdata (e.g., different country data sets) andcomparing the model fits across countrieswithout needing to re-estimate the same modelcountry-by-country. Model estimations wereobtained using variance/covariance matrices.3

For each of the indices, the assumed modelwas estimated separately for each country andfor the entire sample. In a few cases, the modelhad to be respecified; in some cases, where thiswas unsuccessful, only the results for themisfitting model are reported.

For the student data, the cross-countryvalidation was based on the calibration sample.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

218

3 For ordinal variables, Jöreskog and Sörbom (1993)recommend using Weighted Least Square Estimation(WLS) with polychoric correlation matrices and itscorresponding asymptotic covariance weight matricesinstead of ML estimation with covariance matrices. Asthis procedure becomes computationally difficult withlarge amounts of data, the use of covariance matriceswas chosen for the analysis of dimensionality reportedhere.

Here, the sample size was deemed appropriatefor this kind of evaluation. For the school data,the complete school data set for each countrywas taken, given the small number of schoolsper country in the calibration sample.

Indices

In the remainder of the chapter, simple indexvariables (those based on direct recoding ofresponses to one or more variables) are describedfirst, followed by complex indices (those that havebeen constructed by applying IRT scalingmethodology), first for students and second, forschools. Item parameter estimates are providedfor each of the scaled indices.

Student Characteristics andFamily Background

Student ageThe variable AGE gives the age of the studentexpressed in number of months and wascomputed as the difference between the date ofbirth and the middle date of the country’s testingwindow.

Family structureStudents were asked to indicate who usually livesat home with them. The variable FAMSTRUC isbased on the first four items of this question(ST04Q01, ST04Q02, ST04Q03 and ST04Q04).It takes four values and indicates the nature of thestudent’s family: (i) single-parent family (studentswho reported living with one of the following:mother, father, female guardian or male guardian);(ii) nuclear family (students who reported livingwith a mother and a father); (iii) mixed family(students who reported living with a mother anda male guardian, a father and a female guardian,or two guardians); and (iv) other responsecombinations.

Number of siblingsStudents were asked to indicate the number ofsiblings older and younger than themselves, orthe same age. NSIB, the total number of siblings,is derived from the three items of questionST25Q01.

Country of birthStudents were asked if they, their mother, and

their father were born in the country ofassessment or in another country. Responseswere then grouped into: (i) students with native-born parents (students born in the country of theassessment, with at least one parent born in thatcountry); (ii) first-generation students (studentsborn in the country of the assessment but whoseparents were both born in another country); and(iii) non-native students (students born outsidethe country of the assessment and whose parentswere also born in another country). The variableis based on recoding ST16Q01 (student),ST16Q02 (mother) and ST16Q03 (father).

Language spoken at homeStudents were asked if the language spoken athome most of the time was the language ofassessment, another official national language,another national dialect or language, or anotherlanguage. The responses (to ST17Q01) werethen grouped into two categories: (i) languagespoken at home most of the time is differentfrom the language of assessment, from otherofficial national languages and from othernational dialects or languages; and (ii) thelanguage spoken at home most of the time is thelanguage of assessment, is another officialnational language, or other national dialect orlanguage.

Birth orderThe birth order of the assessed student,BRTHORD, is derived from the variablesST05Q01 (number of older siblings), ST05Q02(number of younger siblings) and ST05Q03(number of siblings of equal age). It is equal to 0if the student is the only child, 1 if the student isthe youngest child, 2 if the student is a middlechild, and 3 if the student is the oldest child.

Socio-economic statusStudents were asked to report their mother’s andfather’s occupation, and to state whether eachparent was: in full-time paid work; in part-timepaid work; not working but looking for a paidjob; or other. The open-ended responses werethen coded in accordance with the InternationalStandard Classification of Occupations (ISCO1988).

The PISA International Socio-Economic Indexof Occupational Status (ISEI) was derived fromstudent responses on parental occupation. The

Con

stru

ctin

g an

d v

alid

atin

g th

e q

ues

tion

aire

in

dic

es

219

index capture the attributes of occupations thatconvert parents’ education into income, andwere derived by the optimal scaling ofoccupation groups to maximise the indirecteffect of education on income by occupation,and to minimise the direct effect of education onincome net of occupation (both effects are net ofage). For more information on the methodology,see Ganzeboom, de Graaf and Treiman (1992).

Data for mother’s occupation, father’soccupation and the student’s expected occupationat the age of 30, obtained through questionsST08Q01, ST09Q01, ST10Q01, ST11Q01 andST40Q01, were transformed to SEI indices usingthe above procedures. The variables BMMJ,BFMJ and BTHR are the mother’s SEI, father’sSEI and student’s self-expected SEI, respectively.Two new derived variables, ISEI and HISEI, werethen computed from BMMJ and BFMJ. If BFMJis available, ISEI is equal to BFMJ, while if it isnot available but BMMJ is available, then ISEI isequal to BMMJ. HISEI corresponds to the highervalue of BMMJ and BFMJ.4

Parental educationStudents were asked to classify their mother’s andfather’s highest level of education on the basis ofnational qualifications that were then codedaccording to the International StandardClassification of Education (ISCED, OECD, 1999b)to obtain internationally comparable categoriesof educational attainment. The resultingcategories were: did not go to school; completed<ISCED Level 1 only (primary education)>;completed <ISCED Level 2 only (lowersecondary education)>; completed <ISCED Level3B or 3C only (upper secondary education,

PIS

A 2

00

0 t

ech

nic

al r

epor

t

220

4 To capture wider aspects of a student’s family and home background in addition to occupational status, the PISA initial reportuses an index of economic, social and cultural status that is not in the international database. This index was created on the basisof the following variables: the International Socio-Economic Index of Occupational Status (ISEI); the highest level of education ofthe student’s parents converted into years of schooling; the PISA index of family wealth; the PISA index of home educationalresources; and the PISA index of possessions related to ‘classical culture’ in the family home. The ISEI represents the first principalcomponent of the factors described above. The index was constructed to have a mean of 0 and a standard deviation of 1.

Among these components, the most commonly missing data relate to the International Socio-Economic Index of OccupationalStatus (ISEI), parental education, or both. Separate factor analyses were therefore undertaken for all students with valid data for: i) socio-economic index of occupational status, index of family wealth, index of home educational resources and index ofpossessions related to ‘classical culture’ in the family home; ii) years of parental education, index of family wealth, the index ofhome educational resources and index of possessions related to ‘classical culture’ in the family home; and iii) index of familywealth, index of home educational resources and index of possessions related to ‘classical culture’ in the family home. Studentswere then assigned a factor score based on the amount of data available. For this to be done, students had to have data on at leastthree variables.5 Terms in brackets < > were replaced in the national versions of the Student and School questionnaires by the appropriate nationalequivalent. For example, <qualification at ISCED level 5A> was translated in the United States into ‘Bachelor’s Degree, post-graduate certificate program, Master’s degree program or first professional degree program’. Similarly, <classes in the language ofassessment> in Luxembourg was translated into ‘German classes’ or ‘French classes’ depending on whether students received theGerman or French version of the assessment instruments.

aimed in most countries at providing direct entryinto the labour market)>; completed <ISCEDLevel 3A (upper secondary education, aimed inmost countries at gaining entry into tertiaryeducation)>; and completed <ISCED Level 5A,5B or 6 (tertiary education)>.5

The educational level of the father and themother were collected through two questions foreach parent (ST12Q01 and ST14Q01 for themother and ST13Q01 and ST15Q01 for thefather). The variables in the database formother’s and father’s highest level of educationare labelled MISCED and FISCED.

Validating socio-economic status and parentaleducation

In Canada, the Czech Republic, France and theUnited Kingdom, validity studies were carried outto test the reliability of student answers to thequestions on their parents’ occupations. Generallythese studies indicate that useful data on parentaloccupation can be collected from 15-year-oldstudents. Further details of the validity studiesare presented in Appendix 4.

When reviewing the validity studies it shouldbe kept in mind that 15-year-old students’ reportsof their parents’ occupations are not necessarilyless accurate than adults’ reports of their ownoccupations. Individuals have a tendency to inflatetheir own occupations, especially when the data arecollected by personal (including telephone) inter-views. The correspondence between self-reportsand proxy reports of husbands and wives is alsoimperfect. In addition, official records of a parent’soccupation are arguably subject to greatersources of error (job changes, changes in codingschemes, etc.) than the answers of 15-year-olds.

The important issue here is the highcorrelation between parental occupation statusderived from student and self-reports rather thanthe exact correspondence between ISCOoccupational categories. For example, students’and parents’ reports of a parent’s occupationmay result in it being coded to different ISCOcategories but in very similar SEI scores.

In the PISA validity studies carried out in thefour countries, the correlation of SEI scoresbetween parental occupation as reported bystudents and by their parents was high, between0.70 and 0.86. This correlation was judged to behigh since the test/retest correlation of SEI fromadults who were asked their occupation on twooccasions over a short time period typically liesin this range.

The correspondence between broad occupationalcategories, which averaged about 70 per cent, wasgreater for some occupational groups (e.g.professionals) than others (e.g., managers andadministrators). Only 5 per cent of cases showeda large lack of correspondence, for example,between professionals and unskilled labourers.

Finally, the strength of the relationshipbetween SEI and achievement was found to bevery similar using SEI scores derived either fromthe students’ or from parents’ reports.

The desirability of coding to 4-digit ISCOcodes versus 1 or 2-digit ISCO codes was alsoexplored. SEI scores generated from one-digitISCO codes (that is, only nine major groups)proved to substantially change the strength ofthe relationship between occupational status andachievement, usually reducing it but in someinstances increasing it. The results of coding toonly 2-digit ISCO codes were less clear. For therelationship between father’s SEI andachievement, the correlations between SEI scoresconstructed from 2 or 4-digit ISCO codes wereonly slightly different. However with mother’sSEI, sizeable differences did emerge for some

countries, with differences of up to 0.05 in thecorrelation (or a 24 per cent attenuation). It isnot immediately apparent why the correlationshould change for mothers but not for fathers,but it is likely to be due to the greater clusteringof women in particular jobs and the ability of 4-digit coding to distinguish high and low statuspositions within broad job groups. Since men’soccupations are more evenly spread, theprecision of 4-digit coding will have less of animpact.

Given the differences in the correlationbetween mother’s SEI and achievement, it waspreferable (and safer) to code the occupations to4-digit ISCO codes. Additionally, convertingalready-coded occupations to 2-digit codes from4-digit codes is quite a different exercise thandirectly coding the occupation to two digits.More detailed occupational coding is likely toincrease the coder’s familiarity and thereforeaccuracy with the ISCO coding schema.

Parental Interest and Family Relations

Cultural communication

The PISA index of cultural communication wasderived from students’ reports on the frequencywith which their parents (or guardians) engagedwith them in: discussing political or social issues;discussing books, films or television programmes;and listening to classical music. Frequency wasmeasured on a five-point scale of never or hardlyever; a few times a year; about once a month;several times a month; and several times a week.

Scale scores are standardised Warm estimates.Positive values indicate a higher frequency ofcultural communication and negative valuesindicate a lower frequency of culturalcommunication. The item parameter estimatesused for the weighted likelihood estimation aregiven in Figure 42. For the meaning of delta andtau, see the third paragraph of this chapter.

221

In general, how often do your parents: Parameter Estimates

Delta Tau(1) Tau(2) Tau(3) Tau(4)

ST19Q01 discuss political or social issues with you? -0.05 -0.59 0.43 -0.40 0.56

ST19Q02 discuss books, films or television programmes with you? -0.68 -0.52 0.26 -0.32 0.58

ST19Q03 listen to classical music with you? 0.73 0.71 0.10 -0.60 -0.22

Figure 42: Item Parameter Estimates for the Index of Cultural Communication (CULTCOM)

Con

stru

ctin

g an

d v

alid

atin

g th

e q

ues

tion

aire

in

dic

es

Social communicationThe PISA index of social communication wasderived from students’ reports on the frequencywith which their parents (or guardians) engagedwith them in the following activities: discussinghow well they are doing at school; eating <themain meal> with them around a table; andspending time simply talking with them.Frequency was measured on a five-point scale ofnever or hardly ever; a few times a year; aboutonce a month; several times a month; andseveral times a week.

Scale scores are standardised Warm estimates,where positive values indicate higher frequencyof social communication and negative valuesindicate lower frequency of social communication.The item parameters used for the weightedlikelihood estimation are given in Figure 43.

Family educational supportThe PISA index of family educational supportwas derived from students’ reports on howfrequently the mother, father, or brothers andsisters worked with the student on what isregarded nationally as schoolwork. Studentsresponded to each statement on a five-pointscale with the following: never or hardly ever, a

few times a year, about once a month, severaltimes a month and several times a week.

Scale scores are standardised Warm estimates,where positive values indicate higher frequencyof family (parents and siblings) support for thestudent’s schoolwork while negative valuesindicate lower frequency. Item parameters usedfor the weighted likelihood estimation are givenin Figure 44.

Cultural activitiesThe PISA index of activities related to classicalculture was derived from students’ reports on howoften they had, during the preceding year: visited amuseum or art gallery; attended an opera, ballet orclassical symphony concert; or watched livetheatre. Students responded to each statement on afour-point scale with: never or hardly ever, once ortwice a year, about three or four times a year, andmore than four times a year.

Scale scores are standardised Warm estimateswhere positive values indicate higher frequencyand negative values indicate lower frequency ofcultural activities during the year. Itemparameters used for the weighted likelihoodestimation are given in Figure 45.

222

In general, how often do your parents: Parameter EstimatesDelta Tau(1) Tau(2) Tau(3) Tau(4)

ST19Q04 discuss how well you are doing at school? 0.052 -0.586 -0.012 0.069 0.53

ST19Q05 eat <the main meal> with you around a table? -0.165 0.521 0.605 -0.504 -0.62

ST19Q06 spend time just talking to you? 0.113 0.011 0.143 -0.289 0.14

Figure 43: Item Parameter Estimates for the Index of Social Communication (SOCCOM)

How often do the following people work Parameter Estimateswith you on your <schoolwork>? Delta Tau(1) Tau(2) Tau(3) Tau(4)

ST20Q01 Your mother -0.24 -0.24 -0.07 -0.40 0.71

ST20Q02 Your father 0.01 -0.24 -0.13 -0.40 0.77

ST20Q03 Your brothers and sisters 0.24 0.22 -0.23 -0.38 0.39

Figure 44: Item Parameter Estimates for the Index of Family Educational Support (FAMEDSUP)

During the past year, how often have you Parameter Estimatesparticipated in these activities? Delta Tau(1) Tau(2) Tau(3)

ST18Q02 Visited a museum or art gallery. -0.48 -1.54 0.77 0.77

ST18Q04 Attended an opera, ballet or classical symphony concert. 0.76 -0.46 0.52 -0.05

ST18Q05 Watched live theatre. -0.28 -1.42 0.79 0.63

Figure 45: Item Parameter Estimates for the Index of Cultural Activities (CULTACTV)

PIS

A 2

00

0 t

ech

nic

al r

epor

t

Item dimensionality and reliability of‘parental interest and family relations’ scales

Structural Equation Modelling confirmed theexpected dimensional structure of the items usedfor the parental interest and family relationsindices. All fit measures indicated a good model fitfor the international sample (RMSEA = 0.045,AGFI = 0.97, NNFI = 0.93, and CFI = 0.95), andthe estimated correlation between latent factorswas highest between CULTCOM and SOCCOM,

with 0.69, and lowest for CULTACTV andFAMEDSUP, with 0.17. For the country sub-samples, the RMSEA ranged between 0.032 and0.071, showing that the dimensional structurewas also confirmed across countries.

Table 74 shows the reliability for each of theparental interest and family relations variables foreach participating country. Internal consistency isgenerally rather moderate but still satisfactorygiven the low number of items for each scale.

Con

stru

ctin

g an

d v

alid

atin

g th

e q

ues

tion

aire

in

dic

es

223

Table 74: Reliability of Parental Interest and Family Relations Scales

Country CULTCOM SOCCOM FAMEDSUP CULTACTV

Australia 0.62 0.58 0.66 0.60Austria 0.55 0.53 0.70 0.66Belgium 0.51 0.56 0.66 0.57Canada 0.58 0.57 0.68 0.63Czech Republic 0.53 0.57 0.63 0.64Denmark 0.57 0.64 0.56 0.60Finland 0.52 0.46 0.64 0.62France 0.49 0.53 0.58 0.55Germany 0.55 0.50 0.64 0.64Greece 0.44 0.62 0.70 0.52Hungary 0.50 0.58 0.65 0.66Iceland 0.58 0.54 0.65 0.59Ireland 0.53 0.56 0.69 0.54Italy 0.49 0.55 0.59 0.57Japan 0.59 0.59 0.71 0.50Korea 0.63 0.67 0.74 0.67Luxembourg 0.53 0.58 0.61 0.68Mexico 0.50 0.72 0.66 0.62New Zealand 0.57 0.56 0.69 0.59Norway 0.60 0.65 0.69 0.60Poland 0.57 0.63 0.66 0.68Portugal 0.52 0.62 0.68 0.55Spain 0.50 0.61 0.63 0.56Sweden 0.58 0.57 0.60 0.63Switzerland 0.58 0.47 0.66 0.60United Kingdom 0.51 0.60 0.66 0.63United States 0.63 0.68 0.69 0.65Mean OECD countries 0.55 0.58 0.66 0.63Brazil 0.57 0.65 0.67 0.54Latvia 0.50 0.58 0.65 0.65Liechtenstein 0.59 0.50 0.68 0.58Russian Federation 0.49 0.62 0.67 0.68Netherlands 0.54 0.64 0.67 0.59

Note: ‘Mean OECD countries’ does not include the Netherlands, which did not meet PISA’s sampling standards.

Family Possessions

Family wealthThe PISA index of family wealth was derivedfrom students’ reports on: (i) the availability intheir home of a dishwasher, a room of theirown, educational software, and a link to theInternet; and (ii) the numbers of cellular phones,televisions, computers, motor cars andbathrooms at home.

Scale scores are standardised Warm estimates,where positive values indicate more wealth-related possessions and negative values indicatefewer wealth-related possessions. The itemparameters used for the weighted likelihoodestimation are given in Figure 46.

Home educational resources The PISA index of home educational resourceswas derived from students’ reports on theavailability and number of the following items intheir home: a dictionary, a quiet place to study, adesk for study, text books and calculators.

Scale scores are standardised Warm estimateswhere positive values indicate possession ofmore educational resources and negative valuesindicate possession of fewer educationalresources by the student’s family. The itemparameters used for the weighted likelihoodestimation are given in Figure 47.

Possessions related to ‘classical’ culture in thefamily home

The PISA index of possessions related to‘classical’ culture in the family home was derivedfrom students’ reports on the availability of thefollowing items in their home: classical literature(examples were given), books of poetry andworks of art (examples were given).

Scale scores are standardised Warm estimates,where positive values indicate a greater numberof cultural possessions while negative valuesindicate fewer cultural possessions in the student’shome. The item parameters used for the weightedlikelihood estimation are given in Figure 48.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

224

In your home, do you have: Parameter EstimatesDelta Tau(1) Tau(2) Tau(3)

ST21Q01 a dishwasher? 0.204

ST21Q02 a room of your own? -1.421

ST21Q03 educational software? 0.085

ST21Q04 a link to the Internet? 0.681

How many of these do you have at your home?

ST22Q01 <Cellular> phone 0.225 -0.657 0.137 0.52

ST22Q02 Television -1.332 -2.31 0.506 1.804

ST22Q04 Computer 1.133 -1.688 0.628 1.06

ST22Q06 Motor car 0.379 -1.873 0.262 1.611

ST22Q07 Bathroom 0.046 -3.364 0.973 2.391

Figure 46: Item Parameter Estimates for the Index of Family Wealth (WEALTH)

In your home, do you have: Parameter EstimatesDelta Tau(1) Tau(2) Tau(3)

ST21Q05 a dictionary? -1.19

ST21Q06 a quiet place to study? 0.08

ST21Q07 a desk for study? 0.05

ST21Q08 text books? 0.27

How many of these do you have at your home?ST22Q03 Calculator 0.79 -1.19 0.32 0.88

Figure 47: Item Parameter Estimates for the Index of Home Educational Resources (HEDRES)

Item dimensionality and reliabilityof ‘family possessions’ scales

Structural Equation Modellinglargely confirmed the dimensionalstructure of the items on familypossessions. For the internationalsample both NNFI and CFI wererather low (< 0.85) indicating aconsiderable lack of model fit, butAGFI (0.93) and RMSEA (0.063)showed a still acceptable fit.Estimated correlations between thelatent factors were 0.40 forWEALTH and HEDRES, 0.21 forWEALTH and CULTPOSS, and0.52 for CULTPOSS and HEDRES.The RMSEA measures for thecountry sub-samples ranged from0.050 to 0.079.

Table 75 shows the reliability foreach of the four family possessionsvariables for each participant.Whereas the reliability for WEALTHand CULTPOSS are reasonable, theinternal consistency for HEDRES israther low. Though the reliability ofthese indices might be consideredlower than desirable, they wereretained since they can be interpretedas simple counts of the number ofpossessions. The lower reliabilitysimply suggests that these counts canbe made up of very different subsetsof items.

Con

stru

ctin

g an

d v

alid

atin

g th

e q

ues

tion

aire

in

dic

es

225

In your home, do you have: Parameter EstimatesDelta

ST21Q09 classical literature (e.g., <Shakespeare>)? 0.14

ST21Q10 books of poetry? 0.04

ST21Q11 works of art (e.g., paintings)? -0.18

Figure 48: Item Parameter Estimates for the Index of Possessions Related to ‘Classical Culture’ in theFamily Home (CULTPOSS)

Table 75: Reliability of Family Possessions Scales

Country WEALTH HEDRES CULTPOSS

Australia 0.66 0.43 0.64Austria 0.62 0.28 0.57Belgium 0.61 0.43 0.58Canada 0.69 0.43 0.64Czech Republic 0.67 0.38 0.54Denmark 0.66 0.31 0.60Finland 0.62 0.33 0.65France 0.61 0.30 0.62Germany 0.65 0.31 0.59Greece 0.70 0.30 0.48Hungary 0.70 0.27 0.55Iceland 0.64 0.43 0.57Ireland 0.66 0.35 0.58Italy 0.67 0.22 0.56Japan 0.50 0.20 0.60Korea 0.61 0.17 0.54Luxembourg 0.67 0.52 .067Mexico 0.80 0.48 0.65New Zealand 0.68 0.47 0.65Norway 0.60 0.45 0.69Poland 0.76 0.38 0.53Portugal 0.75 0.29 0.61Spain 0.67 0.26 0.60Sweden 0.68 0.38 0.65Switzerland 0.63 0.31 0.61United Kingdom 0.65 0.46 0.66United States 0.72 0.53 0.64Mean OECD countries 0.70 0.36 0.59Brazil 0.78 0.45 0.50Latvia 0.68 0.32 0.57Liechtenstein 0.60 0.41 0.56Russian Federation 0.63 0.30 0.47Netherlands 0.53 0.29 0.56

Note: ‘Mean OECD countries’ does not include the Netherlands,which did not meet PISA’s sampling standards.

Instruction and LearningInstruction time

Three indices give the time in terms of numberof minutes spent each week at school in thethree assessed domains. The variables werelabelled RMINS for reading courses, MMINS formathematics courses and SMINS for sciencecourses. They are simply the product of thecorresponding item of student question 27(ST27Q01, ST27Q03, ST27Q05) (number ofclass periods in <test language>, mathematicsand science per week) and school question 6(SC06Q03) (number of minutes per single classperiod).

Time spent on homeworkThe PISA index of time spent on homework wasderived from students’ reports on the amount oftime they devoted to homework in the <testlanguage>, in mathematics and in science.Students responded to this on a four-point scale:no time, less than 1 hour a week, between 1 and3 hours a week, and 3 hours or more a week.

Scale scores are standardised Warm estimateswhere positive values indicate more and negativevalues indicate less time spent on homework.The item parameters used for the weightedlikelihood estimation are given in Figure 49.

Teacher supportThe PISA index of teacher support was derivedfrom students’ reports on the frequency withwhich the teacher: shows an interest in everystudent’s learning; gives students an opportunityto express opinions; helps students with theirwork; continues teaching until the studentsunderstand; does a lot to help students; andhelps students with their learning. A four-pointscale was used with response categories: never,some lessons, most lessons and every lesson.

Scale scores are standardised Warm estimateswhere positive values indicate higher levels andnegative values indicate lower levels of teachersupport. The item parameters used for theweighted likelihood estimation are given inFigure 50.

Achievement pressThe PISA index of achievement press was derivedfrom students’ reports on the frequency withwhich, in their <test language> lesson, the teacher:wants students to work hard; tells students thatthey can do better; and does not like it whenstudents deliver <careless> work; and studentshave to learn a lot. A four-point scale was usedwith response categories: never, some lessons,most lessons and every lesson.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

226

On average, how much time do you spend each Parameter Estimatesweek on homework and study in these subject areas? Delta Tau(1) Tau(2) Tau(3)

ST33Q01 <test language> 0.05 -2.59 -0.04 2.62

ST33Q02 <mathematics> -0.23 -2.33 -0.01 2.34

ST33Q03 <science> 0.17 -2.04 0.02 2.02

Figure 49: Item Parameter Estimates for the Index of Time Spent on Homework (HMWKTIME)

How often do these things happen in your Parameter Estimates<test language> lessons? The teacher: Delta Tau(1) Tau(2) Tau(3)

ST26Q05 shows an interest in every student’s learning. 0.22 -1.78 0.36 1.42

ST26Q06 gives students an opportunity to express opinions. -0.35 -1.80 0.35 1.45

ST26Q07 helps students with their work. -0.04 -1.95 0.34 1.61

ST26Q08 continues teaching until the students understand. -0.02 -1.97 0.31 1.66

ST26Q09 does a lot to help students. -0.02 -2.15 0.30 1.86

ST26Q10 helps students with their learning. 0.20 -1.86 0.23 1.63

Figure 50: Item Parameter Estimates for the Index of Teacher Support (TEACHSUP)

Scale scores are standardised Warm estimateswhere positive values indicate higher levels andnegative values indicate lower levels ofachievement press. The item parameters used forthe weighted likelihood estimation are given inFigure 51.

Con

stru

ctin

g an

d v

alid

atin

g th

e q

ues

tion

aire

in

dic

es

227

Item dimensionality and reliability of‘instruction and learning’ scales

Structural Equation Modelling confirmed theexpected dimensional structure of the items usedfor the instruction and learning indices. All fitmeasures indicated a close model fit for theinternational sample (RMSEA = 0.046, AGFI =0.97, NNFI = 0.96 and CFI = 0.97), and theestimated correlations between latent factorswere 0.27 for TEACHSUP and ACHPRESS,0.18 for TEACHSUP and HMWKTIME, and

0.15 for ACHPRESS and HMWKTIME. TheRMSEA for the country sub-samples rangedbetween 0.043 and 0.080, showing that thedimensional structure was similar acrosscountries.

Table 76 shows the reliability for each of thefour instruction and learning variables for eachparticipant. Internal consistency is high for bothHMWKTIME and TEACHSUP but onlymoderate for ACHPRESS.

How often do these things happen in your Parameter Estimates<test language> lessons? Delta Tau(1) Tau(2) Tau(3)

ST26Q02 The teacher wants students to work hard. -0.37 -1.33 0.36 0.97

ST26Q03 The teacher tells students that they can do better. 0.22 -1.63 0.65 0.98

ST26Q04 The teacher does not like it when students deliver 0.19 -1.36 0.62 0.74<careless> work.

ST26Q15 Students have to learn a lot. -0.04 -1.76 0.42 1.34

Figure 51: Item Parameter Estimates for the Index of Achievement Press (ACHPRESS)

Table 76: Reliability of Instruction and Learning Scales

Country HMWKTIME TEACHSUP ACHPRESS

Australia 0.81 0.90 0.53Austria 0.57 0.87 0.59Belgium 0.77 0.85 0.57Canada 0.78 0.90 0.55Czech Republic 0.72 0.80 0.67Denmark 0.71 0.86 0.37Finland 0.79 0.88 0.62France 0.74 0.87 0.62Germany 0.66 0.86 0.57Greece 0.84 0.86 0.50Hungary 0.69 0.85 0.54Iceland 0.72 0.87 0.62Ireland 0.78 0.89 0.50Italy 0.70 0.85 0.51Japan 0.86 0.89 0.47

PIS

A 2

00

0 t

ech

nic

al r

epor

t

228

School and Classroom Climate

Disciplinary climateThe PISA index of disciplinary climatesummarises students’ reports on the frequencywith which, in their <test language> lesson: theteacher has to wait a long time for students to<quieten down>; students cannot work well;students don’t listen to what the teacher says;students don’t start working for a long time afterthe lesson begins; there is noise and disorder;

and, at the start of class, more than five minutesare spent doing nothing. A four-point scale wasused with response categories never, somelessons, most lessons and every lesson.

Scale scores are standardised Warm estimateswhere positive values indicate more positiveperception of the disciplinary climate andnegative values indicate less positive perceptionof the disciplinary climate. The item parametersused for the weighted likelihood estimation aregiven in Figure 52.

How often do these things happen in your Parameter Estimates<test language> lessons? Delta Tau(1) Tau(2) Tau(3)

ST26Q01 The teacher has to wait a long time for students to -0.26 -2.48 0.95 1.53<quieten down>.

ST26Q12 Students cannot work well. 0.47 -2.52 1.02 1.50

ST26Q13 Students don’t listen to what the teacher says. 0.11 -2.79 1.08 1.71

ST26Q14 Students don’t start working for a long time after 0.21 -2.23 0.76 1.47the lesson begins.

ST26Q16 There is noise and disorder. -0.10 -2.09 0.93 1.16

ST26Q17 At the start of class, more than five minutes are spent -0.43 -1.56 0.72 0.84doing nothing.

Figure 52: Item Parameter Estimates for the Index of Disciplinary Climate (DISCLIM)

Table 76 (cont.)Country HMWKTIME TEACHSUP ACHPRESS

Korea 0.88 0.83 0.38Luxembourg 0.67 0.88 0.62Mexico 0.74 0.83 0.49New Zealand 0.80 0.89 0.51Norway 0.82 0.88 0.60Poland 0.79 0.86 0.61Portugal 0.76 0.88 0.52Spain 0.79 0.90 0.53Sweden 0.74 0.88 0.55Switzerland 0.68 0.86 0.55United Kingdom 0.79 0.89 0.44United States 0.77 0.91 0.54Mean OECD countries 0.76 0.87 0.54Brazil 0.80 0.86 0.47Latvia 0.73 0.83 0.64Liechtenstein 0.76 0.84 0.55Russian Federation 0.80 0.84 0.53

Netherlands 0.69 0.82 0.49

Note: ‘Mean OECD countries’ does not include the Netherlands, which did notmeet PISA’s sampling standards.

Teacher-student relationsThe PISA index of teacher-student relations wasderived from students’ reports on their level ofagreement with the following statements:students get along well with most teachers; mostteachers are interested in students’ well-being;most of my teachers really listen to what I haveto say; if I need extra help, I will receive it frommy teachers; and most of my teachers treat mefairly. A four-point scale was used with responsecategories strongly disagree, disagree, agree andstrongly agree.

Scale scores are standardised Warm estimateswhere positive values indicate more positiveperceptions of student−teacher relations andnegative values indicate less positive perceptionsof student−teacher relations. The itemparameters used for the weighted likelihoodestimation are given in Figure 53.

Sense of belongingThe PISA index of sense of belonging wasderived from students’ reports on whether theirschool is a place where they: feel like anoutsider, make friends easily, feel like theybelong, feel awkward and out of place, otherstudents seem to like them, or feel lonely. A four-

point scale was used with response categories:strongly disagree, disagree, agree and stronglyagree.

Scale scores are standardised Warm estimateswhere positive values indicate more positiveattitudes towards school and negative valuesindicate less positive attitudes towards school.The item parameters used for the weightedlikelihood estimation are given in Figure 54.

Item dimensionality and reliability of ‘schooland classroom climate’ scales

For the international sample, the ConfirmatoryFactor Analysis showed a good model fit withRMSEA = 0.053, AGFI = 0.95, NNFI = 0.92and CFI = 0.94. The correlation between latentfactors is -0.23 for STUDREL and DISCLIM,-0.09 for BELONG and DISCLIM, and 0.29for BELONG and STUDREL. For the countrysub-samples, the RMSEA ranged between 0.050and 0.073, showing that the dimensionalitystructure is similar across countries.

Table 77 shows the reliability for each of thethree school and classroom climate variables foreach participating country. Internal consistencyis very high for all three indices.

229

How much do you disagree or agree with each of the Parameter Estimatesfollowing statements about teachers at your school? Delta Tau(1) Tau(2) Tau(3)

ST30Q01 Students get along well with most teachers. 0.22 -2.59 -0.76 3.35

ST30Q02 Most teachers are interested in students’ well-being. 0.10 -2.48 -0.70 3.18

ST30Q03 Most of my teachers really listen to what I have to say. 0.14 -2.57 -0.53 3.09

ST30Q04 If I need extra help, I will receive it from my teachers. -0.26 -2.20 -0.81 3.01

ST30Q05 Most of my teachers treat me fairly. -0.20 -2.04 -0.96 3.00

Figure 53: Item Parameter Estimates for the Index of Teacher-Student Relations (STUDREL)

My school is a place where: Parameter EstimatesDelta Tau(1) Tau(2) Tau(3)

ST31Q01 I feel like an outsider (or left out of things). (rev.) -0.57 -0.94 -0.68 1.62

ST31Q02 I make friends easily. 0.03 -1.83 -0.94 2.77

ST31Q03 I feel like I belong. 0.65 -1.49 -0.92 2.41

ST31Q04 I feel awkward and out of place. (rev.) -0.18 -1.34 -0.56 1.89

ST31Q05 other students seem to like me. 0.59 -1.98 -1.27 3.25

ST31Q06 I feel lonely. (rev.) -0.52 -1.18 -0.64 1.81

Note: Items marked ‘rev.’ had their response categories reversed before scaling.

Figure 54: Item Parameter Estimates for the Index of Sense of Belonging (BELONG)

Con

stru

ctin

g an

d v

alid

atin

g th

e q

ues

tion

aire

in

dic

es

PIS

A 2

00

0 t

ech

nic

al r

epor

t

230

Reading Habits

Engagement in readingThe PISA index of engagement in reading wasderived from students’ level of agreement withthe following statements: I read only if I have to;

reading is one of my favourite hobbies; I liketalking about books with other people; I find ithard to finish books; I feel happy if I receive abook as a present; for me, reading is a waste oftime; I enjoy going to a bookstore or a library; I

Table 77: Reliability of School and Classroom Climate Scales

Country DISCLIM STUDREL BELONG

Australia 0.84 0.83 0.83Austria 0.81 0.80 0.80Belgium 0.82 0.78 0.73Canada 0.83 0.80 0.84Czech Republic 0.81 0.77 0.69Denmark 0.80 0.82 0.75Finland 0.84 0.78 0.83France 0.82 0.78 0.75Germany 0.80 0.76 0.79Greece 0.71 0.71 0.73Hungary 0.84 0.77 0.76Iceland 0.83 0.81 0.83Ireland 0.86 0.77 0.84Italy 0.82 0.75 0.72Japan 0.80 0.85 0.78Korea 0.81 0.82 0.69Luxembourg 0.77 0.80 0.77Mexico 0.70 0.70 0.71New Zealand 0.85 0.78 0.83Norway 0.83 0.82 0.82Poland 0.81 0.78 0.68Portugal 0.77 0.77 0.69Spain 0.80 0.80 0.71Sweden 0.82 0.81 0.81Switzerland 0.77 0.82 0.75United Kingdom 0.87 0.82 0.83United States 0.83 0.83 0.86Mean OECD countries 0.81 0.79 0.77Brazil 0.76 0.76 0.77Latvia 0.79 0.72 0.71Liechtenstein 0.76 0.82 0.81Russian Federation 0.80 0.70 0.71


Note: ‘Mean OECD countries’ does not include the Netherlands, which didnot meet PISA’s sampling standards.

read only to get information that I need; and Icannot sit still and read for more than a fewminutes. A four-point scale was used withresponse categories strongly disagree, disagree,agree and strongly agree.

Scale scores are standardised Warm estimateswhere positive values indicate more positiveattitudes towards reading and negative valuesindicate less positive attitudes towards reading.The item parameters used for the weightedlikelihood estimation are given in Figure 55.

Reading diversityThe PISA index on reading diversity was derivedfrom dichotomised student reports on how often(never to a few times a year versus about once amonth or more often) they read: magazines, comicbooks, fiction, non-fiction books, email and webpages, and newspapers. It is interpreted as anindicator of the variety of student reading sources.

Scale scores are standardised Warm estimateswhere positive values indicate higher levels andnegative values lower levels of reading diversity.

The item parameters used for the weightedlikelihood estimation are given in Figure 56.

Item dimensionality and reliability of‘reading’ scales

A CFA showed a still acceptable fit of the two-dimensional model for the international sample(RMSEA = 0.078, AGFI = 0.91, NNFI = 0.89and CFI = 0.91), with a correlation betweenlatent factors of 0.66. The RMSEA rangedbetween 0.046 and 0.127 across OECDcountries. The lack of fit is largely due to thecommon variance between negatively phrased,and therefore, inverted, items, which is notexplained in the model.

Table 78 shows the reliability for each of thefour reading variables for each participatingcountry. Reliability is generally rather low forDIVREAD but as this is a simple indexmeasuring only the variation in reading material,the low internal consistency was not considereda major concern.

How much do you disagree or agree with these Parameter Estimatesstatements about reading? Delta Tau(1) Tau(2) Tau(3)

ST35Q01 I read only if I have to. (rev.) -0.42 -1.24 -0.18 1.42ST35Q02 Reading is one of my favourite hobbies. 0.83 -1.61 -0.01 1.62ST35Q03 I like talking about books with other people. 1.02 -1.82 -0.28 2.09ST35Q04 I find it hard to finish books. (rev.) -0.55 -1.45 -0.23 1.68ST35Q05 I feel happy if I receive a book as a present. 0.54 -1.59 -0.52 2.11ST35Q06 For me, reading is a waste of time. (rev.) -0.91 -0.87 -0.67 1.54ST35Q07 I enjoy going to a bookstore or a library. 0.45 -1.57 -0.36 1.92ST35Q08 I read only to get information that I need. (rev.) -0.05 -1.85 -0.05 1.90ST35Q09 I cannot sit still and read for more than a few minutes. (rev.)-0.91 -0.96 -0.46 1.42Note: Items marked ‘rev.’ had their response categories reversed before scaling.

Figure 55: Item Parameter Estimates for the Index of Engagement in Reading (JOYREAD)

How often do you read these materials because you want to? Parameter EstimatesDelta

ST36Q01 Magazines -1.52ST36Q02 Comic books 0.75ST36Q03 Fiction (novels, narratives, stories) 0.60ST36Q04 Non-fiction books 1.04ST36Q05 Emails and Web pages 0.16ST36Q06 Newspapers -1.03

Note: Items were dichotomised as follows (0 = 1,2 = never or a few times a year); (1 = 3,4,5 = about once a month,several times a month or several times a week).

Figure 56: Item Parameter Estimates for the Index of Reading Diversity (DIVREAD) Con

stru

ctin

g an

d v

alid

atin

g th

e q

ues

tion

aire

in

dic

es

231

Table 78: Reliability of Reading Habits Scales

Country JOYREAD DIVREAD

Australia 0.76 0.51Austria 0.76 0.34Belgium 0.74 0.55Canada 0.77 0.50Czech Republic 0.72 0.38Denmark 0.75 0.58Finland 0.75 0.43France 0.71 0.54Germany 0.76 0.40Greece 0.66 0.48Hungary 0.70 0.53Iceland 0.72 0.48Ireland 0.76 0.48Italy 0.71 0.48Japan 0.66 0.44Korea 0.68 0.59Luxembourg 0.71 0.53Mexico 0.61 0.61New Zealand 0.73 0.49Norway 0.75 0.44Poland 0.69 0.50Portugal 0.70 0.50Spain 0.70 0.53Sweden 0.75 0.46Switzerland 0.76 0.46United Kingdom 0.75 0.50United States 0.76 0.61Mean OECD countries 0.72 0.50Brazil 0.70 0.59Latvia 0.64 0.44Liechtenstein 0.73 0.50Russian Federation 0.64 0.51Netherlands 0.74 0.55

Note: ‘Mean OECD countries’ does not include theNetherlands, which did not meet PISA’s samplingstandards.

Motivation and Interest

The following indices were derived from theinternational option on Self-Regulated Learning,which about a quarter of the countries chose notto administer.

Instrumental motivationThe PISA index of instrumental motivation wasderived from students’ reports on how often theystudy to: increase their job opportunities; ensurethat their future will be financially secure; andenable them to get a good job. A four-point scalewas used with response categories almost never,sometimes, often and almost always.

Scale scores are standardised Warm estimateswhere positive values indicate higher levels andnegative values indicate lower levels ofendorsement of instrumental motivation forlearning. The item parameters used for theweighted likelihood estimation are given inFigure 57.

Interest in reading The PISA index of interest in reading wasderived from student agreement with the threestatements listed in Figure 58. A four-point scalewas used with response categories disagree,disagree somewhat, agree somewhat and agree.For information on the conceptual underpinningof the index see Baumert, Gruehn, Heyn, Köllerand Schnabel (1997).

Scale scores are standardised Warm estimateswhere positive values indicate higher levels andnegative values indicate lower levels of interestin reading. The item parameters used for theweighted likelihood estimation are given inFigure 58.

232

How often do these things apply to you? Parameter EstimatesDelta Tau(1) Tau(2) Tau(3)

CC01Q06 I study to increase my job opportunities. 0.04 -2.20 -0.01 2.20

CC01Q14 I study to ensure that my future will be financially secure. 0.26 -2.07 0.02 2.05

CC01Q22 I study to get a good job. -0.30 -2.12 0.01 2.11

Figure 57: Item Parameter Estimates for the Index of Instrumental Motivation (INSMOT)

PIS

A 2

00

0 t

ech

nic

al r

epor

t

Interest in mathematics The PISA index of interest in mathematics wasderived from students’ responses to the threeitems listed in Figure 59. A four-point scale withthe response categories disagree, disagreesomewhat, agree somewhat and agree was used.For information on the conceptual underpinningof the index see Baumert, Gruehn, Heyn, Köllerand Schnabel (1997).

Scale scores are standardised Warm estimateswhere positive values indicate a greater interestin mathematics and negative values a lowerinterest in mathematics. The item parametersused for the weighted likelihood estimation aregiven in Figure 59.

Item dimensionality and reliability of‘motivation and interest’ scales

In a CFA, the fit for the three-dimensional modelwas acceptable (RMSEA = 0.040, AGFI = 0.98,NNFI = 0.98 and CFI = 0.99), and thecorrelation between the three latent dimensionswas 0.13 for INTREA and INSMOT, 0.27 forINTMAT and INSMOT, and 0.11 for INTMATand INTREA. Across countries, the RMSEAranged between 0.014 and 0.074, whichconfirmed the dimensional structure for allcountries.

Table 79 shows the reliability for each of thethree motivation and interest variables for eachparticipating country. For all three scales, theinternal consistency is satisfactory across countries.

233

How much do you disagree or agree with each of Parameter Estimatesthe following? Delta Tau(1) Tau(2) Tau(3)

CC02Q06 Because reading is fun, I wouldn’t want to give it up. 0.18 -1.55 -0.06 1.61CC02Q13 I read in my spare time. 0.17 -1.45 -0.19 1.64CC02Q17 When I read, I sometimes get totally absorbed. -0.35 -1.40 -0.15 1.55

Figure 58: Item Parameter Estimates for the Index of Interest in Reading (INTREA)


CC02Q01 When I do mathematics, I sometimes get totally absorbed. -0.02 -1.55 -0.27 1.83

CC02Q10 Because doing mathematics is fun, I wouldn’t want to 0.27 -1.30 -0.13 1.42give it up.

CC02Q21 Mathematics is important to me personally. -0.25 -1.20 -0.28 1.48

Figure 59: Item Parameter Estimates for the Index of Interest in Mathematics (INTMAT)

Con

stru

ctin

g an

d v

alid

atin

g th

e q

ues

tion

aire

in

dic

es

Table 79: Reliability of Motivation and Interest Scales

COUNTRY INSMOT INTREA INTMAT

Australia 0.83 0.83 0.72Austria 0.81 0.84 0.72Belgium (Fl.) 0.84 0.87 0.78Czech Republic 0.81 0.86 0.68Denmark 0.79 0.85 0.83Finland 0.82 0.88 0.82Germany 0.83 0.85 0.76Hungary 0.85 0.87 0.75Iceland 0.81 0.78 0.66Ireland 0.85 0.84 0.71Italy 0.84 0.82 0.72Korea 0.72 0.82 0.83Luxembourg 0.82 0.81 0.72Mexico 0.69 0.61 0.51New Zealand 0.85 0.83 0.75Norway 0.83 0.85 0.79Portugal 0.84 0.81 0.73Sweden 0.85 0.75 0.77Switzerland 0.78 0.84 0.77United States 0.83 0.82 0.74Mean OECD countries 0.82 0.83 0.75Brazil 0.83 0.73 0.65Latvia 0.69 0.75 0.72Liechtenstein 0.83 0.84 0.73Russian Federation 0.76 0.78 0.74Netherlands 0.82 0.84 0.76Scotland 0.79 0.86 0.72

Note: ‘Mean OECD countries’ does not include the Netherlands, which didnot meet PISA’s sampling standards. The Flemish Community ofBelgium constitutes half of Belgium’s population and has been given thestatus of a country for these analyses.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

234

Learning Strategies

Control strategiesThe PISA index of control strategies was derivedfrom students’ responses to the five items listedin Figure 60. A four-point scale was used withthe response categories almost never, sometimes,often and almost always. For information on theconceptual underpinning of the index seeBaumert, Heyn and Köller (1994).

Scale scores are standardised Warm estimateswhere positive values indicate higher frequency

and negative values indicate lower frequency ofself-reported use of control strategies. The itemparameters used for the weighted likelihoodestimation are given in Figure 60.

MemorisationThe PISA index of memorisation strategies wasderived from students’ reports on the four items inFigure 61. A four-point scale was used with theresponse categories almost never, sometimes, oftenand almost always. For information on theconceptual underpinning of the index see Baumert,Heyn and Köller (1994) and Pintrich et al. (1993).

Scale scores are standardised Warm estimateswhere positive values indicate higher frequencyand negative values indicate lower frequency ofself-reported use of memorisation as a learningstrategy. Item parameters used for the weightedlikelihood estimation are given in Figure 61.

ElaborationThe PISA index of elaboration strategies wasderived from students’ reports on the four itemsin Figure 62. A four-point scale with theresponse categories almost never, sometimes,often and almost always was used. Forinformation on the conceptual underpinning ofthe index, see Baumert, Heyn and Köller (1994).

Scale scores are standardised Warm estimateswhere positive values indicate higher frequencyand negative values indicate lower frequency ofself-reported use of elaboration as a learningstrategy. The item parameters used for the weightedlikelihood estimation are given in Figure 62.

How often do these things apply to you? Parameter EstimatesWhen I study, Delta Tau(1) Tau(2) Tau(3)

CC01Q03 I start by figuring out exactly what I need to learn. -0.13 -1.73 0.1 1.63CC01Q13 I force myself to check to see if I remember what I 0.23 -1.75 0.23 1.52

have learned.CC01Q19 I try to figure out which concepts I still haven’t 0.02 -2.43 0.22 2.21

really understood.CC01Q23 I make sure that I remember the most important things. -0.54 -2.09 0.07 2.01CC01Q27 and I don’t understand something I look for additional 0.43 -1.99 0.24 1.74

information to clarify this.

Figure 60: Item Parameter Estimates for the Index of Control Strategies (CSTRAT)


CC01Q01 I try to memorise everything that might be covered. -0.26 -1.81 0.30 1.51CC01Q05 I memorise as much as possible. -0.10 -1.63 0.17 1.46CC01Q10 I memorise all new material so that I can recite it. 0.62 -1.70 0.26 1.44CC01Q15 I practise by saying the material to myself over and over. -0.26 -1.57 0.22 1.35

Figure 61: Item Parameter Estimates for the Index of Memorisation (MEMOR)


CC01Q09 I try to relate new material to things I have learned in 0.26 -2.39 0.20 2.20other subjects.

CC01Q17 I figure out how the information might be useful in the 0.32 -2.24 0.23 2.01real world.

CC01Q21 I try to understand the material better by relating it to -0.40 -2.58 0.24 2.35things I already know.

CC01Q25 I figure out how the material fits in with what I have -0.18 -2.77 0.25 2.52already learned.

Figure 62: Item Parameter Estimates for the Index of Elaboration (ELAB)

Con

stru

ctin

g an

d v

alid

atin

g th

e q

ues

tion

aire

in

dic

es

235

Effort and perseveranceThe PISA index of effort and perseverance wasderived from students’ reports on the four itemsin Figure 63. A four-point scale with theresponse categories: almost never, sometimes,often and almost always was used.

Scale scores are standardised Warm estimateswhere positive values indicate higher frequencyand negative values indicate lower frequency ofeffort and perseverance as a learning strategy.

The item parameters used for the weightedlikelihood estimation are given in Figure 63.

Item dimensionality and reliability of‘learning strategies’ scales

Table 80 shows the reliability of each of the fourlearning strategies variables for eachparticipating country. For all four scales, theinternal consistency is satisfactory acrosscountries.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

236


CC01Q07 I work as hard as possible. 0.09 -2.58 0.31 2.27

CC01Q12 I keep working even if the material is difficult. 0.33 -2.54 0.22 2.32

CC01Q20 I try to do my best to acquire the knowledge and -0.19 -2.82 0.15 2.68skills taught.

CC01Q28 I put forth my best effort. -0.23 -2.59 0.30 2.29

Figure 63: Item Parameter Estimates for the Index of Effort and Perseverance (EFFPER)

Table 80: Reliability of Learning Strategies Scales

Country CSTRAT MEMOR ELAB EFFPER

Australia 0.81 0.74 0.79 0.81Austria 0.67 0.71 0.73 0.76Belgium (Fl.) 0.69 0.76 0.77 0.78Czech Republic 0.75 0.74 0.79 0.75Denmark 0.70 0.63 0.75 0.76Finland 0.77 0.71 0.79 0.78Germany 0.73 0.73 0.75 0.77Hungary 0.72 0.71 0.78 0.76Iceland 0.72 0.70 0.76 0.77Ireland 0.77 0.75 0.78 0.82Italy 0.74 0.66 0.80 0.78Korea 0.72 0.69 0.76 0.78Luxembourg 0.77 0.77 0.76 0.78Mexico 0.74 0.75 0.77 0.74New Zealand 0.80 0.73 0.76 0.80Norway 0.75 0.77 0.80 0.79Portugal 0.77 0.73 0.74 0.78Sweden 0.76 0.74 0.79 0.80Switzerland 0.74 0.70 0.74 0.79United States 0.83 0.77 0.80 0.83Mean OECD countries 0.76 0.72 0.77 0.78Brazil 0.77 0.67 0.73 0.77Latvia 0.66 0.54 0.63 0.70

Con

stru

ctin

g an

d v

alid

atin

g th

e q

ues

tion

aire

in

dic

es

237

Learning Style

Co-operative learningThe PISA index of co-operative learning wasderived from student reports on the four items inFigure 64. A four-point scale with the responsecategories disagree, disagree somewhat, agreesomewhat and agree was used. For informationon the conceptual underpinning of the index, seeOwens and Barnes (1992).

Scale scores are standardised Warm estimateswhere positive values indicate higher levels ofself-perception of preference for co-operativelearning and negative values lower levels of self-perception of this preference. The itemparameters used for the weighted likelihoodestimation are given in Figure 64.

Competitive learning

The PISA index of competitive learning wasderived from students’ responses to the items inFigure 65, which also gives item parameters usedfor the weighted likelihood estimation. A four-point scale with the response categories disagree,disagree somewhat, agree somewhat and agreewas used. For information on the conceptualunderpinning of the index, see Owens andBarnes (1992).

Scale scores are standardised Warm estimateswhere positive values indicate higher reportedlevels of self-perception of preference forcompetitive learning and negative values indicatelower levels of self-perception of this preference.


CC02Q04 I like to try to be better than other students. 0.19 -1.82 -0.09 1.91

CC02Q11 Trying to be better than others makes me work well. 0.28 -1.82 -0.29 2.10

CC02Q16 I would like to be the best at something. -0.92 -1.10 -0.20 1.30

CC02Q24 I learn faster if I’m trying to do better than the others. 0.45 -1.99 -0.01 2.00

Figure 65: Item Parameter Estimates for the Index of Competitive Learning (COMLRN)


CC02Q02 I like to work with other students. -0.21 -0.85 -0.59 1.44

CC02Q08 I learn most when I work with other students. 0.74 -1.52 -0.11 1.63

CC02Q19 I like to help other people do well in a group. -0.04 -1.29 -0.60 1.89

CC02Q22 It is helpful to put together everyone’s ideas when -0.50 -1.03 -0.59 1.62working on a project.

Figure 64: Item Parameter Estimates for the Index of Co-operative Learning (COPLRN)

Table 80 (cont.)

Country CSTRAT MEMOR ELAB EFFPER

Liechtenstein 0.78 0.72 0.80 0.81Russian Federation 0.71 0.56 0.77 0.76Netherlands 0.70 0.65 0.75 0.77Scotland 0.75 0.68 0.73 0.76

Note: ‘Mean OECD countries’ does not include the Netherlands, which did not meet PISA’s sampling standards. TheFlemish Community of Belgium constitutes half of Belgium’s population and has been given the status of acountry for these analyses.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

238

Item dimensionality and reliability of ‘co-operative and competitive learning’ scales

A CFA showed a poor fit for the initial model,which is largely due to a similar wording ofitems CC02Q02 and CC02Q08. Afterintroducing a correlated error term for theseitems, the model fit was satisfactory for theinternational sample (RMSEA = 0.048, AGFI = 0.98, NNFI = 0.97 and CFI = 0.98).The RMSEA ranged between 0.029 and 0.095across countries, which confirmed the two-dimensional structure for the sub-samples.

Table 81 shows the reliability of the twolearning style variables for each participant. Theinternal consistency for COPLRN is clearlylower than for COMLRN.

Table 81: Reliability of Learning Style Scales

Country COPLRN COMLRN

Australia 0.64 0.77Austria 0.61 0.78Belgium (Fl.) 0.59 0.77Czech Republic 0.64 0.74Denmark 0.60 0.81Finland 0.67 0.80Germany 0.68 0.75Hungary 0.57 0.72Iceland 0.64 0.79Ireland 0.66 0.80Italy 0.67 0.77Korea 0.60 0.76Luxembourg 0.73 0.75Mexico 0.62 0.71New Zealand 0.68 0.79Norway 0.70 0.81Portugal 0.64 0.79Sweden 0.62 0.80Switzerland 0.63 0.77United States 0.77 0.78Mean OECD countries 0.68 0.78Brazil 0.59 0.73Latvia 0.65 0.71Liechtenstein 0.64 0.74Russian Federation 0.66 0.72Netherlands 0.60 0.81Scotland 0.61 0.80

Note: ‘Mean OECD countries’ does not include theNetherlands, which did not meet PISA’s samplingstandards. The Flemish Community of Belgiumconstitutes half of Belgium’s population and hasbeen given the status of a country for theseanalyses.

Self-concept

Self-concept in reading The PISA index of self-concept in reading wasderived from students’ reports on the three itemsin Figure 66. A four-point scale was used withresponse categories disagree, disagree somewhat,agree somewhat and agree. For information onthe conceptual underpinning of the index, seeMarsh, Shavelson and Byrne (1992).

Scale scores are standardised Warm estimateswhere positive values indicate a higher level ofself-concept in reading and negative values alower level of self-concept in reading. The itemparameters used for the weighted likelihoodestimation are given in Figure 66.

Student self-concept in mathematicsThe PISA index of self-concept in mathematicswas derived from students’ responses to the itemsin Figure 67. A four-point scale with the responsecategories disagree, disagree somewhat, agreesomewhat and agree was used. For informationon the conceptual underpinning of the index, seeMarsh, Shavelson and Byrne (1992).

Scale scores are standardised Warm estimateswhere positive values indicate a greater self-concept in mathematics and negative valuesindicate a lower self-concept in mathematics.The item parameters used for the weightedlikelihood estimation are given in Figure 67.

Student academic self-conceptThe PISA index of academic self-concept was

derived from student responses to the items inFigure 68, which gives item parameters used forthe weighted likelihood estimation. A four-pointscale with the response categories disagree,disagree somewhat, agree somewhat and agreewas used. For information on the conceptualunderpinning of the index, see Marsh, Shavelsonand Byrne (1992).

Scale scores are standardised Warm estimateswhere positive values indicate higher levels ofacademic self-concept and negative values, lowerlevels of academic self-concept.

Item dimensionality and reliability of ‘self-concept’ scales

The dimensional structure of the items wasconfirmed by a CFA. The fit was acceptable forthe international sample (RMSEA = 0.073,AGFI = 0.95, NNFI = 0.95 and CFI = 0.97).The correlation between latent factors was 0.09between MATCON and SCVERB, 0.56 betweenMATCON and SCACAD, and 0.61 betweenSCVERB and SCACAD. The RMSEA rangedbetween 0.039 and 0.112 across countries,showing a variation in model fit. There was,however, only one country with a clearly poor fitfor this model.

Table 82 shows the reliabilities of thesecountry indices, which are satisfactory in almostall participating countries the internalconsistency of SCVERB is below 0.60 in onlyone country.

Con

stru

ctin

g an

d v

alid

atin

g th

e q

ues

tion

aire

in

dic

es

239


CC02Q05 I’m hopeless in <test language> classes. (rev.) -0.52 -1.79 0.25 1.54

CC02Q09 I learn things quickly in <test language> class. 0.37 -2.21 -0.28 2.49

CC02Q23 I get good marks in <test language>. 0.15 -2.04 -0.42 2.45Note: Items marked ‘rev.’ had their response categories reversed before scaling.

Figure 66: Item Parameter Estimates for the Index of Self-Concept in Reading (SCVERB)


CC02Q12 I get good marks in mathematics. -0.48 -2.60 -0.36 2.96

CC02Q15 Mathematics is one of my best subjects. 0.36 -2.20 0.01 2.19

CC02Q18 I have always done well in mathematics. 0.12 -2.63 0.00 2.63

Figure 67: Item Parameter Estimates for the Index of Self-Concept in Mathematics (MATCON)


CC02Q03 I learn things quickly in most school subjects. -0.13 -2.79 -0.41 3.20

CC02Q07 I’m good at most school subjects. -0.02 -2.54 -0.39 2.93

CC02Q20 I do well in tests in most school subjects. 0.14 -2.59 -0.34 2.94

Figure 68: Item Parameter Estimates for the Index of Academic Self-Concept (SCACAD)

PIS

A 2

00

0 t

ech

nic

al r

epor

t

240

Learning Confidence

Perceived self-efficacyThe PISA index of perceived self-efficacy wasderived from students’ responses to the three itemsin Figure 69, which gives item parameters used forthe weighted likelihood estimation. A four-pointscale was used with response categories almostnever, sometimes, often and almost always.

Scale scores are standardised Warm estimateswhere positive values indicate a higher sense ofperceived self-efficacy and negative values, alower sense of perceived self-efficacy.

Table 82: Reliability of Self-Concept Scales

Country SCVERB MATCON SCACAD

Australia 0.76 0.85 0.74Austria 0.81 0.88 0.76Belgium (Fl.) 0.72 0.85 0.70Czech Republic 0.74 0.84 0.76Denmark 0.78 0.87 0.80Finland 0.80 0.93 0.84Germany 0.82 0.90 0.78Hungary 0.66 0.87 0.73Iceland 0.77 0.90 0.81Ireland 0.79 0.87 0.77Italy 0.82 0.87 0.74Korea 0.68 0.89 0.78Luxembourg 0.73 0.87 0.74Mexico 0.52 0.84 0.71New Zealand 0.81 0.88 0.79Norway 0.74 0.90 0.85Portugal 0.74 0.87 0.73Sweden 0.75 0.88 0.81Switzerland 0.75 0.88 0.74United States 0.76 0.86 0.79Mean OECD countries 0.75 0.88 0.79Brazil 0.60 0.85 0.73Latvia 0.67 0.85 0.66Liechtenstein 0.73 0.84 0.77Russian Federation 0.67 0.87 0.72Netherlands 0.73 0.89 0.76Scotland 0.81 0.87 0.74

Note: ‘Mean OECD countries’ does not include the Netherlands, which did notmeet PISA’s sampling standards. The Flemish Community of Belgiumconstitutes half of Belgium’s population and has been given the status of acountry for these analyses.

Control expectationThe PISA index of control expectation wasderived from students’ responses to the fouritems in Figure 70, which gives item parametersused for the weighted likelihood estimation. Afour-point scale was used with responsecategories almost never, sometimes, often andalmost always.

Scale scores are standardised Warm estimateswhere positive values indicate higher controlexpectation and negative values indicate lowercontrol expectation.

Item dimensionality and reliability of‘learning confidence’ scales

The CFA for the two-dimensional model showedan acceptable model fit for the internationalsample (RMSEA = 0.069, AGFI = 0.96, NNFI =0.95 and CFI = 0.97). The correlation betweenthe latent factors was as high as 0.93 but forconceptual reasons it was decided to keep twoseparate scales. The RMSEA for the countrysub-samples ranged between 0.046 and 0.111.

Table 83 shows the reliability of each of thetwo learning confidence scales for eachparticipating country. Internal consistency isgood for both indices across countries.

Con

stru

ctin

g an

d v

alid

atin

g th

e q

ues

tion

aire

in

dic

es

241


CC01Q02 I’m certain I can understand the most difficult material 0.42 -2.77 0.45 2.32presented in texts.

CC01Q18 I’m confident I can do an excellent job on assignments -0.26 -2.66 0.29 2.38and tests.

CC01Q26 I’m certain I can master the skills being taught. -0.16 -2.86 0.26 2.6

Figure 69: Item Parameter Estimates for the Index of Perceived Self-Efficacy (SELFEF)


CC01Q04 When I sit myself down to learn something really -0.13 -2.25 0.34 1.90difficult, I can learn it.

CC01Q11 If I decide not to get any bad grades, I can really do it. -0.09 -2.10 0.30 1.81

CC01Q16 If I decide not to get any problems wrong, I can really do it. 0.80 -2.44 0.35 2.09

CC01Q24 If I want to learn something well, I can. -0.58 -2.49 0.29 2.20

Figure 70: Item Parameter Estimates for the Index of Control Expectation (CEXP)

PIS

A 2

00

0 t

ech

nic

al r

epor

t

242

Table 83: Reliability of Learning ConfidenceScales

Country SELFEF CEXP

Australia 0.73 0.80Austria 0.63 0.70Belgium (Fl.) 0.64 0.72Czech Republic 0.66 0.73Denmark 0.72 0.75Finland 0.77 0.81Germany 0.67 0.71Hungary 0.71 0.72Iceland 0.79 0.78Ireland 0.74 0.78Italy 0.67 0.71Korea 0.68 0.75Luxembourg 0.67 0.74Mexico 0.67 0.76New Zealand 0.70 0.79Norway 0.75 0.82Portugal 0.69 0.75Sweden 0.75 0.79Switzerland 0.61 0.72United States 0.77 0.79Mean OECD countries 0.70 0.75Brazil 0.66 0.72Latvia 0.57 0.68Liechtenstein 0.69 0.74Russian Federation 0.67 0.70Netherlands 0.63 0.70Scotland 0.71 0.74

Note: ‘Mean OECD countries’ does not include theNetherlands, which did not meet PISA’s samplingstandards. The Flemish Community of Belgiumconstitutes half of Belgium’s population and hasbeen given the status of a country for theseanalyses.

Computer familiarity or informationtechnology

The following indices were derived from theinternational option on Computer Familiarity,which was administered in almost two-thirds ofthe countries.

Comfort with and perceived ability to usecomputers

The PISA index of comfort with and perceivedability to use computers was derived fromstudents’ responses to the following questions:How comfortable are you using a computer?How comfortable are you using a computer towrite a paper? How comfortable are you takinga test on a computer?, and If you compareyourself with other 15-year-olds, how would yourate your ability to use a computer? For the firstthree questions, a four-point scale was used withthe response categories very comfortable,comfortable, somewhat comfortable and not at allcomfortable. For the last question, a four-pointscale was used with the response categoriesexcellent, good, fair and poor. For informationon the conceptual underpinning of the index, seeEignor, Taylor, Kirsch and Jamieson (1998).

Scale scores are standardised Warm estimateswhere positive values indicate a higher self-per-ception of computer abilities and negative valuesindicate a lower self-perception of computer abilities.The item parameters used for the weightedlikelihood estimation are given in Figure 71.

Computer usageThe PISA index of computer usage was derivedfrom students’ responses to the six questions inFigure 72, which gives the item parameters usedfor the weighted likelihood estimation. A five-point scale with the response categories almostevery day, a few times each week, between once

How comfortable: Parameter EstimatesDelta Tau(1) Tau(2) Tau(3)

IT02Q01 are you with using a computer? -0.64 -1.71 0.01 1.70IT02Q02 are you with using a computer to write a paper? -0.26 -1.38 -0.13 1.51IT02Q03 would you be taking a test on a computer? 0.50 -1.32 -0.11 1.43IT03Q01 If you compare yourself with other 15-year-olds, how 0.40 -2.52 -0.15 2.67

would you rate your ability to use a computer?Note: All items were reversed for scaling.

Figure 71: Item Parameter Estimates for the Index of Comfort With and Perceived Ability to Use Computers (COMAB)

a week and once a month, less than once amonth and never was used. For information onthe conceptual underpinning of the index seeEignor, Taylor, Kirsch and Jamieson (1998).

Scale scores are standardised Warm estimateswhere positive values indicate higher frequencyand negative values indicate lower frequency ofcomputer use.

Interest in computersThe PISA index of interest in computers wasderived from the students’ responses to the fouritems in Figure 73, which gives item parametersused for the weighted likelihood estimation. A two-point scale with the response categoriesyes and no was used.

Scale scores are standardised Warm estimateswhere positive values indicate a more positiveattitude towards computers and negative values,a less positive attitude towards computers.

Item dimensionality and reliability ofIT-related scales

The three-dimensional model was confirmed bya CFA for the international sample (RMSEA =0.070, AGFI = 0.92, NNFI = 0.91 and CFI =0.91). The correlation between latent factors was0.60 for COMUSE and COMAB, 0.59 forCOMATT and COMAB, and 0.50 forCOMATT and COMUSE. The RMSEA for thecountry sub-samples ranged between 0.049 and0.095.

Table 84 shows the reliability for each of thethree IT variables for each participant. Whereasinternal consistency for COMUSE and COMABis very high, COMATT has a somewhat lowerbut still satisfactory reliability in view of the factthat it consists of only four dichotomous items.

243

How often do you use: Parameter EstimatesDelta Tau(1) Tau(2) Tau(3) Tau(4)

IT05Q03 the computer to help you learn school material? -0.17 -0.92 -0.75 0.23 1.44

IT05Q04 the computer for programming? 0.35 -0.27 -0.47 -0.06 0.80

How often do you use each of the following kinds of computer software?IT06Q02 Word processing (e.g., Word® or Word Perfect®) -0.71 -0.70 -1.18 0.24 1.64

IT06Q03 Spreadsheets (e.g., Lotus® or Excel®) 0.23 -1.01 -0.60 0.17 1.44

IT06Q04 Drawing, painting or graphics -0.10 -1.12 -0.38 0.22 1.28

IT06Q05 Educational software 0.40 -1.06 -0.48 0.21 1.33

Note: All items were reversed for scaling.

Figure 72: Item Parameter Estimates for the Index of Computer Usage (COMUSE)

Parameter EstimatesDelta

IT07Q01 It is very important to me to work with a computer. 0.53

IT08Q01 To play or work with a computer is really fun. -1.28

IT09Q01 I use a computer because I am very interested in this. 0.11

IT10Q01 I forget the time, when I am working with the computer. 0.64


Figure 73: Item Parameter Estimates for the Index of Interest in Computers (COMATT)

Con

stru

ctin

g an

d v

alid

atin

g th

e q

ues

tion

aire

in

dic

es

PIS

A 2

00

0 t

ech

nic

al r

epor

t

244

School typeA school was classified as either public orprivate according to whether a public agency ora private entity had the ultimate decision-makingpower concerning its affairs. A school wasclassified as public if the principal reported thatit was managed directly or indirectly by a publiceducation authority, government agency, or by agoverning board appointed by government orelected by public franchise. A school wasclassified as private if the principal reported thatit was managed directly or indirectly by a non-government organisation (e.g., a church, a tradeunion, business or another private institution).

A distinction was made between government-dependent and independent private schoolsaccording to the degree of dependence ongovernment funding. School principals wereasked to specify the percentage of the school’stotal funding received in a typical school yearfrom: government sources; student fees or schoolcharges paid by parents; benefactors, donations,bequests, sponsorships or parental fund-raising;and other sources. Schools were classified asgovernment-dependent private if they received50 per cent or more of their core funding fromgovernment agencies and independent private ifthey received less than 50 per cent of their corefunding from government agencies.• SCHLTYPE is based on the question

SC03Q01 and SC04Q01 to SC04Q04. Thisvariable has the following categories:1=Private, government-independent;2=Private, government-dependent;3=Government.

School sizeSCHLSIZE is the sum of questions SC02Q01and SC02Q02, which ask for the total numberof boys and the total number of girls enrolled asof March 31, 2000.

Percentage of girlsPCGIRLS is the ratio of the number of girls tothe total enrolment (SCHLSIZE).

Table 84: Reliability of Computer Familiarity orInformation Technology Scales

Country COMAB COMUSE COMATT

Australia 0.84 0.81 0.69Belgium (Fl.) 0.85 0.80 0.65Canada 0.83 0.84 0.68Czech Republic 0.85 0.79 0.60Denmark 0.81 0.82 0.73Finland 0.85 0.83 0.69Germany 0.86 0.83 0.68Hungary 0.72 0.76 0.69Ireland 0.86 0.82 0.59Luxembourg 0.87 0.85 0.71Mexico 0.80 0.83 0.57New Zealand 0.85 0.83 0.67Norway - 0.80 -Sweden 0.79 0.83 0.68Switzerland 0.86 0.82 0.73United States 0.79 0.80 0.69Mean OECD coun. 0.83 0.82 0.68Brazil 0.87 0.81 0.51Latvia 0.80 0.80 0.65Liechtenstein 0.83 0.83 0.72Russian Federation 0.83 0.83 0.76Scotland 0.86 0.79 0.57

Note: In Norway, only one item was included for the scaleCOMAB; no items were included for COMATT. The Flemish Community of Belgium constitutes halfof Belgium’s population and has been given the statusof a country for these analyses.

Basic School Characteristics

The indices in this section were derived from theSchool Questionnaire.

Hours of schoolingThe PISA index of hours of schooling per year was basedon information provided by principals on: the numberof instructional weeks in the school year; the numberof <class periods> in the school week; and thenumber of teaching minutes in the average single<class period>. The index was derived from theproduct of these three factors, divided by 60, andhas the variable label TOTHRS.

School Policies and Practices

School and teacher autonomySchool principals were asked to report whetherteachers, department heads, the school principal,an appointed or elected board, or higher leveleducation authorities were primarily responsiblefor: appointing and dismissing teachers;establishing teachers’ starting salaries anddetermining their increases; formulating andallocating school budgets; establishing studentdisciplinary and student assessment policies;approving students for admission; choosingtextbooks; determining course content; anddeciding which courses were offered.

The PISA index of school autonomy wasderived from the number of categories thatprincipals classified as not being a schoolresponsibility (which were inverted before scaling).

Scale scores are standardised Warm estimateswhere positive values indicate higher levels andnegative values indicate lower levels of schoolautonomy. The item parameters used for theweighted likelihood estimation are given inFigure 74.

The PISA index of teacher autonomy wasderived from the number of categories thatprincipals identified as being mainly theresponsibility of teachers.

Scale scores are standardised Warm estimateswhere positive values indicate higher levels andnegative values indicate lower levels of teacherparticipation in school decisions. The itemparameters used for the weighted likelihoodestimation are given in Figure 75.

245

In your school, who has the main responsibility for: Parameter Estimates(not a school responsibility) Delta

SC22Q01 hiring teachers? -0.87SC22Q02 firing teachers? -1.49SC22Q03 establishing teachers’ starting salaries? -3.65SC22Q04 determining teachers’ salary increases? -3.49SC22Q05 formulating the school budget? -0.30SC22Q06 deciding on budget allocations within the school? 2.00SC22Q07 establishing student disciplinary policies? 3.67SC22Q08 establishing student assessment policies? 1.91SC22Q09 approving students for admittance to school? 0.53SC22Q10 choosing which textbooks are used? 2.09SC22Q11 determining course content? -0.32SC22Q12 deciding which courses are offered? -0.09Note: All items were reversed for scaling.

Figure 74: Item Parameter Estimates for the Index of School Autonomy (SCHAUTON)

In your school, who is primarily responsible for: (teachers) Parameter Estimates Delta

SC22Q01 hiring teachers? 1.68SC22Q02 firing teachers? 3.98SC22Q03 establishing teachers’ starting salaries? 4.27SC22Q04 determining teachers’ salary increases? 3.48SC22Q05 formulating the school budget? 1.16SC22Q06 deciding on budget allocations within the school? -0.16SC22Q07 establishing student disciplinary policies? -2.85SC22Q08 establishing student assessment policies? -3.31SC22Q09 approving students for admittance to school? 0.64SC22Q10 choosing which textbooks are used? -4.06SC22Q11 determining course content? -3.12SC22Q12 deciding which courses are offered? -1.73Note: All items were reversed for scaling.

Figure 75: Item Parameter Estimates for the Index of Teacher Autonomy (TCHPARTI) Con

stru

ctin

g an

d v

alid

atin

g th

e q

ues

tion

aire

in

dic

es

PIS

A 2

00

0 t

ech

nic

al r

epor

t

246

Item dimensionality and reliability of ‘schooland teacher autonomy’ scales

A comparative CFA based on covariance matricescould not be completed for this index because inseveral countries there was no variance for someof the items used. This is plausible given that,depending on the educational system, schoolsand/or teachers never decide certain matters.

Table 85 shows the reliability for each schooland teacher autonomy scale for eachparticipating country. Not surprisingly, thisvaries considerably across countries.

Table 85: Reliability of School Policies andPractices Scales

Country SCHAUTON TCHPARTI

Australia 0.75 0.80Austria 0.58 0.61Belgium 0.61 0.59Canada 0.75 0.71Czech Republic 0.78 0.66Denmark 0.62 0.78Finland 0.67 0.65France 0.78 0.63Germany 0.81 0.75Greece 0.87 0.49Hungary 0.60 0.78Iceland 0.64 0.62Ireland 0.60 0.57Italy 0.57 0.67Japan 0.82 0.82Korea 0.61 0.62Mexico 0.83 0.61New Zealand 0.11 0.74Portugal 0.48 0.50Spain 0.77 0.72Sweden 0.69 0.76Switzerland 0.74 0.68United Kingdom 0.71 0.74United States 0.78 0.82Mean OECD countries 0.77 0.73Brazil 0.82 0.66Latvia 0.64 0.63Liechtenstein 0.88 0.82Russian Federation 0.61 0.38

Netherlands 0.41 0.77

Note: ‘Mean OECD countries’ does not include theNetherlands, which did not meet PISA’s samplingstandards. No results are available for Norwayor Poland, and schools in Luxembourg allprovided the same answers.

School Climate

School principals’ perceptions of teacher-related factors affecting school climate

The PISA index of the principals’ perceptions ofteacher-related factors affecting school climatewas derived from principals’ reports on theextent to which 15-year-olds were hindered intheir learning by: the low expectations ofteachers; poor student-teacher relations; teachersnot meeting individual students’ needs; teacherabsenteeism; staff resisting change; teachers beingtoo strict with students; and students not beingencouraged to achieve their full potential. Thequestions asked are shown in Figure 76. A four-point scale with the response categories not at all,very little, to some extent and a lot was used, andthe responses were inverted before scaling.

Scale scores are standardised Warm estimates,where positive values indicate that teacher-related factors do not hinder the learning of 15-year-olds, whereas negative values indicateschools in which there is a perception thatteacher-related factors hinder the learning of 15-year-olds. The item parameters used for theweighted likelihood estimation are given inFigure 76.

School principals’ perceptions of student-related factors affecting school climate

The PISA index of the principals’ perceptionsof student-related factors affecting school climatewas derived from principals’ reports on the extentto which 15-year-olds were hindered in theirlearning by: student absenteeism; disruption ofclasses by students; students skipping classes;students lacking respect for teachers; the use ofalcohol or illegal drugs; and students intimidatingor bullying other students. A four-point scale withthe response categories not at all, very little, tosome extent and a lot was used, and the responseswere inverted before scaling.

Scale scores are standardised Warm estimates,where positive values indicate that student-relatedfactors do not hinder the learning of 15-year-olds,whereas negative values indicate schools in whichthere is a perception that student-related factorshinder the learning of 15-year-olds. The itemparameters used for the weighted likelihoodestimation are given in Figure 77.

School principals’ perception of teachers’morale and commitment

The PISA index of principals’ perception ofteachers’ morale and commitment was derivedfrom the extent to which school principalsagreed with the statements in Figure 78, whichgives the item parameters used for the weighted

likelihood estimation. A four-point scale with theresponse categories strongly disagree, disagree,agree and strongly agree was used.

Scale scores are standardised Warm estimates,where positive values indicate a higherperception of teacher morale, and negativevalues a lower perception of teacher morale.

In your school, is the learning of <15-year-old students> Parameter Estimateshindered by: Delta Tau(1) Tau(2) Tau(3)

SC19Q01 low expectations of teachers? 0.28 -2.15 -0.36 2.50

SC19Q03 poor student-teacher relations? 0.00 -2.45 0.77 1.68

SC19Q07 teachers not meeting individual students’ needs? -0.57 -2.81 -0.07 2.88

SC19Q08 teacher absenteeism? -0.05 -2.50 0.31 2.18

SC19Q11 staff resisting change? -0.21 -2.40 -0.23 2.63

SC19Q14 teachers being too strict with students? 0.85 -2.63 0.66 1.97

SC19Q16 students not being encouraged to achieve their full -0.30 -2.07 -0.05 2.13

potential?Note: All items were reversed for scaling.

Figure 76: Item Parameter Estimates for the Index of Principals’ Perceptions of Teacher-Related Factors AffectingSchool Climate (TEACBEHA)


SC19Q02 student absenteeism? -1.02 -2.25 -0.07 2.31

SC19Q06 disruption of classes by students? -0.62 -2.86 0.04 2.83

SC19Q09 students skipping classes? -0.29 -2.24 0.1 2.14

SC19Q10 students lacking respect for teachers? 0.09 -2.8 0.19 2.61

SC19Q13 the use of alcohol or illegal drugs? 1.02 -1.68 0.82 0.86

SC19Q15 students intimidating or bullying other students? 0.82 -2.76 0.39 2.37


Figure 77: Item Parameter Estimates for the Index of Principals’ Perceptions of Student-Related Factors AffectingSchool Climate (STUDBEHA)

Think about the teachers in your school. How much do Parameter Estimatesyou agree or disagree with the following statements? Delta Tau(1) Tau(2) Tau(3)

SC20Q01 The morale of teachers in this school is high. 0.55 -2.45 -0.62 3.07

SC20Q02 Teachers work with enthusiasm. 0.07 -2.64 -1.05 3.70

SC20Q03 Teachers take pride in this school. 0.05 -2.32 -1.09 3.41

SC20Q04 Teachers value academic achievement. -0.67 -1.16 -1.70 2.86

Figure 78: Item Parameter Estimates for the Index of Principals’ Perceptions of Teachers’ Morale and Commitment(TCMORALE)

Con

stru

ctin

g an

d v

alid

atin

g th

e q

ues

tion

aire

in

dic

es

247

Item dimensionality and reliability of ‘schoolclimate’ scales

The model fit in a CFA was satisfactory for theinternational sample (RMSEA = 0.071, AGFI = 0.91, NNFI = 0.89 and CFI = 0.91).The correlation between latent factors was 0.77for TEACBEHA and STUDBEHA, 0.37 forTCMORALE and TEACBEHA, and 0.23 forTCMORALE and STUDBEHA. For the countrysamples, the RMSEA ranged between 0.043 and0.107.

Table 86 shows that the reliability for each ofthe school climate variables is high acrosscountries.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

248

Table 86: Reliability of School Climate Scales

Country TEACBEHA STUDBEHA TCMORALE

Australia 0.86 0.86 0.74Austria 0.73 0.75 0.69Belgium 0.81 0.88 0.73Canada 0.83 0.81 0.78Czech Republic 0.78 0.76 0.70Denmark 0.70 0.82 0.73Finland 0.76 0.66 0.66France 0.73 0.80 0.79Germany 0.77 0.78 0.79Greece 0.90 0.83 0.84Hungary 0.80 0.83 0.71Iceland 0.78 0.77 0.87Ireland 0.81 0.81 0.81Italy 0.87 0.81 0.80Japan 0.82 0.85 0.89Korea 0.82 0.78 0.87Luxembourg 0.78 0.77 0.67Mexico 0.89 0.88 0.87New Zealand 0.82 0.78 0.84Norway 0.78 0.76 0.76Poland 0.73 0.82 0.59Portugal 0.75 0.81 0.80Spain 0.83 0.83 0.64Sweden 0.76 0.74 0.75Switzerland 0.77 0.83 0.74United Kingdom 0.90 0.86 0.81United States 0.86 0.74 0.88Mean OECD countries 0.83 0.81 0.79Brazil 0.85 0.84 0.83Latvia 0.79 0.73 0.71Liechtenstein 0.78 0.89 0.52Russian Federation 0.79 0.80 0.74


Note: ‘Mean OECD countries’ does not include the Netherlands, which did not meetPISA’s sampling standards.

School Resources

Quality of the schools’ educational resourcesThe PISA index of the quality of a school’seducational resources was based on the schoolprincipals’ reports on the extent to which 15-year-olds were hindered in their learning by alack of resources. The questions asked areshown in Figure 79, which gives item parametersused for the weighted likelihood estimation. Afour-point scale with the response categories notat all, very little, to some extent and a lot wasused and the responses were inverted beforescaling.

Scale scores are standardised Warm estimates,where positive values indicate that the learningof 15-year-olds was not hindered by the(un)availability of educational resources.Negative values indicate the perception of alower quality of educational materials at school.

Quality of the schools’ physical infrastructureThe PISA index of the quality of a school’sphysical infrastructure was derived fromprincipals’ reports on the extent to whichlearning by 15-year-olds in their school washindered by: poor condition of buildings; poorheating, cooling and/or lighting systems; and

lack of instructional space (e.g., classrooms). Thequestions are shown in Figure 80, which givesthe item parameters used for the weightedlikelihood estimation. A four-point scale with theresponse categories not at all, very little, to someextent and a lot was used and responses werereversed before scaling.

Scale scores are standardised Warm estimates,where positive values indicate that the learningof 15-year-olds was not hindered by the school’sphysical infrastructure, and negative valuesindicate the perception that the learning of 15-year-olds was hindered by the school’s physicalinfrastructure.

Teacher shortage The PISA index of teacher shortage was derivedfrom the principals’ view on how much 15-year-old students were hindered in their learning bythe shortage or inadequacy of teachers in the<test language>, mathematics or science. Theitems are given in Figure 81, which shows theitem parameters used for the weighted likelihoodestimation. A four-point scale with the responsecategories not at all, a little, somewhat and a lotwas used, and items were reversed beforescaling.

249

In your school, how much is the learning of Parameter Estimates<15-year-old students> hindered by: Delta Tau(1) Tau(2) Tau(3)

SC11Q04 lack of instructional material (e.g., textbooks)? 0.76 -1.41 0.06 1.35

SC11Q05 not enough computers for instruction? -0.30 -1.50 -0.18 1.67

SC11Q06 lack of instructional materials in the library? 0.10 -1.66 -0.11 1.76

SC11Q07 lack of multi-media resources for instruction? -0.39 -1.79 -0.19 1.99

SC11Q08 inadequate science laboratory equipment? -0.07 -1.51 -0.08 1.59

SC11Q09 inadequate facilities for the fine arts? -0.09 -1.45 -0.07 1.51


Figure 79: Item Parameter Estimates for the Index of the Quality of the Schools’ Educational Resources (SCMATEDU)

In your school, how much is the learning of Parameter Estimates<15-year-old students> hindered by: Delta Tau(1) Tau(2) Tau(3)

SC11Q01 poor condition of buildings? 0.09 -1.27 -0.47 1.75SC11Q02 poor heating, cooling and/or lighting systems? 0.21 -1.36 -0.34 1.70SC11Q03 lack of instructional space (e.g., classrooms)? -0.30 -1.18 -0.34 1.53Note: All items were reversed for scaling.

Figure 80: Item Parameter Estimates for the Index of the Quality of the Schools’ Physical Infrastucture (SCMATBUI)

Con

stru

ctin

g an

d v

alid

atin

g th

e q

ues

tion

aire

in

dic

es

Scale scores are standardised Warm estimates,where positive values indicate that the learningof 15-year-olds was not hindered by teachershortage or inadequacy. Negative values indicatethe perception that 15-year-olds were hinderedin their learning by teacher shortage orinadequacy.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

250

Item dimensionality and reliability of ‘schoolresources’ scales

A CFA showed an acceptable model fit for theinternational sample (RMSEA = 0.077, AGFI = 0.91, NNFI = 0.93 and CFI = 0.94).The correlation between latent factors was 0.67for SCMATEDU and SCMATBUI, 0.32 forTCSHORT and SCMATBUI, and 0.36 forTCSHORT and SCMATEDU. The RMSEA forthe country samples ranged between 0.054 and0.139.

Table 87 shows the reliability for each of theschool resources scales (apart from computers,which are covered in the next section) for eachparticipating country. Internal consistency issatisfactory across countries for all three scales(with the sole exception of SCMATBUI forLiechtenstein, where the number of schools wasvery small).


SC21Q01 a shortage/inadequacy of teachers -0.51 -2.25 0.16 2.08

SC21Q02 a shortage/inadequacy of <test language> teachers? 0.37 -1.64 0.18 1.46

SC21Q03 a shortage/inadequacy of <mathematics> teachers? 0.15 -1.68 0.09 1.60

SC21Q04 a shortage/inadequacy of <science> teachers? -0.01 -1.61 0.02 1.59


Figure 81: Item Parameter Estimates for the Index of Teacher Shortage (TCSHORT)

Table 87: Reliability of School Resources Scales

Country SCMATEDU SCMATBUI TCSHORT

Australia 0.89 0.75 0.82Austria 0.77 0.77 0.67Belgium 0.86 0.73 0.83Canada 0.87 0.80 0.87Czech Republic 0.83 0.65 0.65Denmark 0.79 0.70 0.76Finland 0.84 0.82 0.72France 0.83 0.72 0.88Germany 0.86 0.78 0.84Greece 0.84 0.91 0.95Hungary 0.79 0.59 0.87Iceland 0.80 0.71 0.83Ireland 0.84 0.75 0.78Italy 0.81 0.72 0.89Japan 0.84 0.70 0.93

Availability of computers

School principals provided information on thetotal number of computers available in theirschools and more specifically on the number ofcomputers: available to 15-year-olds; availableonly to teachers; available only to administrativestaff; connected to the Internet; and connected toa local area network. Six indices of computeravailability were prepared and included in thedatabase:• RATCOMP, which is the total number of

computers in the school divided by the schoolenrolment size;

• PERCOMP1, which is the number ofcomputers available for 15-year-olds(SC13Q02) divided by the total number ofcomputers (SC13Q01);

• PERCOMP2, which is the number ofcomputers available for teachers only(SC13Q03) divided by the total number ofcomputers (SC13Q01);

• PERCOMP3, which is the number ofcomputers available only for administrativestaff (SC13Q04) divided by the total numberof computers (SC13Q01);

• PERCOMP4, which is the number ofcomputers connected to the Web (SC13Q05)divided by the total number of computers(SC13Q01); and

• PERCOMP5, which is the number ofcomputers connected to a local area network(SC13Q06) divided by the total number ofcomputers (SC13Q01).

Teacher qualificationsSchool principals indicated the number of fulland part-time teachers employed in their schoolsand the numbers of: <test language> teachers,mathematics teachers and science teachers;teachers fully certified by the <appropriatenational authority>; teachers with a qualificationof <ISCED level 5A> in <pedagogy>, or <ISCEDlevel 5A> in the <test language>, or <ISCEDlevel 5A> in <mathematics>, or <ISCED level5A> in <science>. Five variables wereconstructed:• PROPQUAL, which is the proportion of

teachers with an ISCED 5A qualification inpedagogy (SC14Q02) divided by the totalnumber of teachers (SC14Q01);

Con

stru

ctin

g an

d v

alid

atin

g th

e q

ues

tion

aire

in

dic

es

251

Table 87 (cont.)

Country SCMATEDU SCMATBUI TCSHORT

Korea 0.86 0.75 0.93Luxembourg 0.75 0.63 0.80Mexico 0.88 0.81 0.91New Zealand 0.87 0.82 0.90Norway 0.86 0.77 0.88Poland 0.84 0.79 0.88Portugal 0.85 0.68 0.87Spain 0.85 0.73 0.85Sweden 0.83 0.79 0.87Switzerland 0.79 0.68 0.74United Kingdom 0.91 0.90 0.89United States 0.77 0.73 0.85Mean OECD countries 0.85 0.79 0.88Brazil 0.87 0.83 0.88Latvia 0.79 0.70 0.62Liechtenstein 0.45 0.17 0.59Russian Federation 0.78 0.77 0.85


Note: ‘Mean OECD countries’ does not include the Netherlands, which did not meetPISA’s sampling standards.

• PROPCERT, which is the proportion ofteachers fully certified by the appropriateauthority (SC14Q03) divided by the totalnumber of teachers (SC14Q01);

• PROPREAD, which is the proportion of testlanguage teachers with an ISCED 5Aqualification (SC14Q05) divided by thenumber of test language teachers (SC14Q04);

• PROPMATH, which is the proportion ofmathematics teachers with an ISCED 5Aqualification (SC14Q07) divided by thenumber of mathematics teachers (SC14Q06);and

• PROPSCIE, which is the proportion ofscience teachers who have an ISCED 5Aqualification (SC14Q09) divided by the totalnumber of science teachers (SC14Q08).

Student/teaching staff ratioThe student/teaching staff ratio (STRATIO) wasdefined as the number of full-time equivalentteachers divided by the number of students inthe school. To convert head counts into full-timeequivalents, a full-time teacher employed for atleast 90 per cent of the statutory time as aclassroom teacher received a weight of 1, and apart-time teacher employed for less than 90 percent of the time as a classroom teacher receiveda weight of 0.5.

Class sizeAn estimate of class size was obtained fromstudents’ reports on the number of students intheir respective <test language>, mathematicsand science classes.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

252

International Database

Christian Monseur

Files in the Database

The PISA international database consists offive data files: four student-level files and oneschool-level file. All are provided in text (orASCII format) with the corresponding SAS® andSPSS® control files.

The student files

Student and reading performance data files(filename: intstud_read.txt)

For each student who participated in theassessment, the following information is available:• Identification variables for the country, school

and student;• The student responses on the three

questionnaires, i.e., the student questionnaireand the two international options: ComputerFamiliarity or Information Technologyquestionnaire (IT) and Cross-CurriculumCompetencies questionnaire (CCC);

• The students’ indices derived from the originalquestions in the questionnaires;

• The students’ performance scores in reading;• The student weights and a country adjustment

factor for the reading weights; and• The 80 reading Fay’s replicates for the compu-

tation of the sampling variance estimates.

Student and mathematics performance datafiles (filename: intstud_math.txt)

For each student who was assessed with oneof the booklets containing mathematics material,the following information is available:

• Identification variables for the country, schooland student;

• The students’ responses on the threequestionnaires, i.e., the student questionnaireand the two international options: ComputerFamiliarity or Information Technologyquestionnaire (IT) and Cross-CurriculumCompetencies questionnaire (CCC);


• The students’ performance scores in readingand mathematics;

• The student weights and a country adjustmentfactor for the mathematics weights; and

• The 80 reading Fay’s replicates for the compu-tation of the sampling variance estimates.

Student and science performance data files(filename: intstud_scie.txt)

For each student who was assessed with oneof the booklets that contain science material, thefollowing information is available:• Identification variables for the country, school

and student;• The student responses on the three

questionnaires, i.e., the student questionnaireand the two international options: ComputerFamiliarity or Information Technologyquestionnaire (IT) and Cross-CurriculumCompetencies questionnaire (CCC);


• The students’ performance scores in readingand science;

• The student weights and a country adjustmentfactor for the science weights; and

Chapter 18

• The 80 reading Fay’s replicates for thecomputation of the sampling varianceestimates.

The school file

The school questionnaire data file (filename:intscho.txt)

For each school that participated in theassessment, the following information isavailable:• The identification variables for the country

and school;• The school responses on the school

questionnaire; and• The school indices derived from the original

questions in the school questionnaire.• The school weight.

The assessment items data file

The cognitive file (filename: intcogn.txt)For each item included in the test, this file

shows the students' responses expressed in aone-digit format. The items from mathematicsand science used double-digit coding duringmarking1. A file including these codes wasavailable to national centres.

Records in the database

Records included in the database

Student level• All PISA students who attended one of the two

test (assessment) sessions.• PISA students who only attended the

questionnaire session are included if theyprovided a response to the father’s occupation

questions or the mother’s occupationquestions on the student questionnaire(questions 8 to 11).

School level• All participating schools—that is, any school

where at least 25 per cent of the sampledeligible students were assessed— have arecord in the school-level internationaldatabase, regardless of whether the schoolreturned the school questionnaire.

Records excluded from the database

Student level• Additional data collected by some countries

for a national or international option such asa grade sample.

• Sampled students who were reported as noteligible, students who were no longer atschool, students who were excluded forphysical, mental or linguistic reasons, andstudents who were absent on the testing day.

• Students who refused to participate in theassessment sessions.

• Students from schools where less than 25 per centof the sampled and eligible students participated.

School level• Schools where fewer than 25 per cent of the

sampled eligible students participated in thetesting sessions.

Representing missing data

The coding of the data distinguishes betweenfour different types of missing data:

• Item level non-response: 9 for a one-digit variable,99 for a two-digit variable, 999 for a three-digitvariable, and so on. Missing codes are shownin the codebooks. This missing code is used ifthe student or school principal was expectedto answer a question, but no response wasactually provided.

• Multiple or invalid responses: 8 for a one-digitvariable, 98 for a two-digit variable, 998 for athree-digit variable, and so on. This code isused for multiple choice items in both testbooklets and questionnaires where an invalidresponse was provided. This code is not usedfor open-ended questions.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

254

1 The responses from open-ended items could givevaluable information about students' ideas and thinking,which could be fed back into curriculum planning. Forthis reason, the marking guides for these items in mathe-matics and science were designed to include a two-digitmarking so that the frequency of various types of correctand incorrect response could be recorded. The first digitwas the actual score. The second digit was used tocategorise the different kings of response on the basisof the strategies used by the student to answer the item.The international database includes only the first digit.

• Not applicable: 7 for a one-digit variable, 97for a two-digit variables, 997 for a three-digitvariable, and so on for the studentquestionnaire data file and for the school datafile. Code “n” is used for a one-digit variablein the test booklet data file. This code is usedwhen it was not possible for the student toanswer the question. For instance, this code isused if a question was misprinted or if aquestion was deleted from the questionnaireby a national centre. The “not-applicable”codes and code “n” are also used in the testbooklet file for questions that were notincluded in the test booklet that the studentreceived.

• Not reached items: all consecutive missingvalues starting from the end of each testsession were replaced by the non-reachedcode, “r”, except for the first value of themissing series, which is coded as missing.

How are students andschools identified?

The student identification from the studentfiles consists of three variables, which togetherform a unique identifier for each student:• The country identification variable labelled

COUNTRY. The country codes used in PISAare the ISO 3166 country codes.

• The school identification variable labelledSCHOOLID.

• The student identification variable labelledSTIDSTD.A fourth variable has been included to

differentiate sub-national entities withincountries. This variable (SUBNATIO) is used forfour countries as follows:• Australia. The eight values “01”, “02”, “03”,

“04”, “05”, “06”, “07”, “08” are assigned tothe Australian Capital Territory, New SouthWales, Victoria, Queensland, South Australia,Western Australia, Tasmania and theNorthern Territory respectively.

• Belgium. The value “01” is assigned to theFrench Community and the value “02” isassigned to the Flemish Community.

• Switzerland. The value “01” is assigned to theGerman-speaking community, the value “02”is assigned to the French-speaking communityand the value “03” is assigned to the Italian-speaking community.

• United Kingdom. The value “01” is assignedto Scotland, the value “02” is assigned toEngland and the value “03” is assigned toNorthern Ireland.The school identification consists of two

variables, which together form a uniqueidentifier for each school:• The country identification variable labelled

COUNTRY. The country codes used in PISAare the ISO 3166 country codes.

• The school identification variable labelledSCHOOLID.

The student files

Two types of indices are provided in thestudent questionnaire files. The first set is basedon a transformation of one variable or on acombination of the information included in twoor more variables. Eight indices are included inthe database from this first type. The second setis the result of a Rasch scaling and consists ofweighted likelihood estimate indices. Fifteenindices from the student questionnaire, 14indices from the international option on cross-curriculum competencies questionnaire and 3indices from the international option oncomputer familiarity or information technologyquestionnaire are included in the database fromthis second type. For a full description of theindices and how to interpret them see theManual for the PISA 2000 database (OECD,2002a), also available throughwww.pisa.oecd.org.

Two types of estimates of student achievementare included in the student files for each domain(reading, mathematics and science) and for eachsubscale (retrieving information, interpretingtexts and reflection and evaluation): a weightedlikelihood estimate and a set of five plausiblevalues. It is recommended that the set of plausiblevalues be used when analysing and reportingstatistics at the population level. The use ofweighted likelihood estimates for populationestimates will yield biased estimates. The weightedlikelihood estimates for the domain scales wereeach transformed to a mean of 500 and a standarddeviation of 100 by using the data of the partic-ipating OECD Member countries only2. These

Inte

rnat

ion

al d

atab

ase

255

2 With the exception of the Netherlands, which didnot reach the PISA 2000 sampling standards

linear transformations used weighted data, butthe weights provided in the internationaldatabase were transformed so that the sum ofthe weights per country was a constant. Thesame transformation was applied to the readingsub-scales. The linear transformation applied tothe plausible values is based on the same rules asthe ones used for the weighted estimates, exceptthat the standardisation parameters were derivedfrom the average of the mean and standarddeviation computed from the five plausiblevalues. For this reason, the means and variancesof the individual plausible values are not exactly500 and 100 respectively, but the average of thefive means and the five variances is.

In the international data files, the variablecalled W_FSTUWT is the final student weight.The sum of the weights constitutes an estimateof the size of the target population, i.e., the numberof 15-year-old students in that country attendingschool. In this situation large countries wouldhave a stronger contribution to the results thansmall countries. These weights are appropriate forthe analysis of data that have been collectedfrom all assessed students, such as student question-naire data, and reading performance data.

Because of the PISA test design, using thereading weights for analysing the mathematics orscience data will overweight the studentsassessed with booklet zero (or SE Booklet) andtherefore (typically) underestimate the results. Tocorrect this over-weighting of the students takingthe SE booklet, weight adjustment factors mustbe applied to the weights and replicates. Becauseof the need to use these adjustment factors inanalyses, and to avoid accidental misuse of thestudent data, these data are provided in the threeseparate files: • The file Intstud_read.txt comprises the reading

ability estimates and weights. This file containsall eligible students who participated in thesurvey. As the sample design assessed readingby all students, no adjustment was needed;

• The file Intstud_math.txt comprises thereading and mathematics ability estimates.Weights and replicates in this file have alreadybeen adjusted by the mathematics adjustmentfactor. Thus, no further transformations ofthe weights or replicates are required byanalysts of the data; and

• The file Intstud_scie.txt comprises the readingand science ability estimates. Weights and re-

plicates in this file have already been adjustedby the science adjustment factor. Thus, nofurther transformations of the weights orreplicates are required by analysts of the data.For a full description of the weighting

methodology, including the test design,calculation of the reading weights, adjustmentfactors and how to use the weights, see theManual for the PISA 2000 database (OECD,2002a), also available throughwww.pisa.oecd.org.

The cognitive files

The file with the test data (filename:intcogn.txt) contains individual students’responses to all items used for the internationalitem calibration and in the generation of theplausible values. All item responses included inthis file have a one-digit format, which containsthe score for the student on that item.

The PISA items are organised into units. Eachunit consists of a piece of text or related texts,followed by one or more questions. Each unit isidentified by a short label and by a long label. Theunits’ short labels consist of four characters. Thefirst character is R, M or S respectively for reading,mathematics or science. The three next charactersindicate the unit name. For example, R083 is areading unit called ‘Household’. The full itemlabel (usually seven-digit) represents each particularquestion within a unit. Thus items within a unithave the same initial four characters: all items inthe unit ‘Household’ begin with ‘R083’, plus aquestion number: for example, the third questionin the ‘Household’ unit is R083Q03.

In this file, the items are sorted by domainand alphabetically by short label within domain.This means that the mathematics items appear atthe beginning of the file, followed by the readingitems and then the science items. Withindomains, units with smaller numericidentification appear before those with largeridentification, and within each unit, the firstquestion will precede the second, and so on.

The school file

The school files contain the original variablescollected through the school contextquestionnaire.

Two types of indices are provided in the

PIS

A 2

00

0 t

ech

nic

al r

epor

t

256

school questionnaire files. The first set is basedon a transformation of one variable or on acombination of two or more variables. The data-base includes 16 indices from this first type. Thesecond set is the result of a Rasch scaling andconsists of weighted likelihood estimate indices.Eight indices are included in the database fromthis second type. For a full description of theindices and how to interpret them see the Manualfor the PISA 2000 database (OECD, 2002a),also available through www.pisa.oecd.org.

The school base weight (WNRSCHBW),which has been adjusted for school non-response, is provided at the end of the schoolfile. PISA uses an age sample instead of a gradesample. Additionally, the PISA sample of schoolin some countries included primary schools,lower secondary schools, upper secondaryschools, or even special education schools. Forthese two reasons, it is difficult to conceptuallydefine the school population, except that it is thepopulation of schools with at least one 15-year-old student. While in some countries, thepopulation of schools with 15-year-olds issimilar to the population of secondary schools,in other countries, these two populations ofschools are very different.

A recommendation is to analyse the schooldata at the student level. From a practical pointof view, it means that the school data should tobe imported into the student data file. From atheoretical point of view, while it is possible toestimate the per centage of schools following aspecific school characteristic, it is notmeaningful. Instead, the recommendation is toestimate the per centages of students followingthe same school characteristic. For instance, theper centages of private schools versus publicschools will not be estimated, but the percentages of students attending a private schoolversus the per centage of students attendingpublic schools will.

Further information

A full description of the PISA 2000 databaseand guidelines on how to analyse it inaccordance with the complex methodologiesused to collect and process the data is providedin the Manual for the PISA 2000 database(OECD, 2002a) also available throughwww.pisa.oecd.org

Inte

rnat

ion

al d

atab

ase

257

References

Adams, R.J., Wilson, M.R. and Wang, W. (1997).The multidimensional random coefficientsmultinomial logit model. Applied PsychologicalMeasurement, 21, 1-24.

Adams, R.J., Wilson, M.R. and Wu, M.L. (1997).Multilevel item response modeling: An approachto errors in variables regression. Journal ofEducational and Behavioral Statistics, 22, 47-76.

Beaton, A.E. (1987). Implementing the new design:The NAEP 1983-84 Technical Report. (ReportNo. 15-TR-20). Princeton, NJ: EducationalTesting Service.

Beaton, A.E., Mullis, I.V., Martin, M.O., Gonzalez,E.J., Kelly, D.L. and Smith, T.A. (1996).Mathematics Achievement in the Middle SchoolYears. Chestnut Hill, MA: Center for the Studyof Testing, Evaluation and Educational Policy,Boston College.

Baumert, J., Gruehn, S., Heyn, S., Köller, O. andSchnabel, K.U. (1997). Bildungsverläufe undPsychosoziale Entwicklung im Jugendalter(BIJU): Dokumentation - Band 1. Berlin: Max-Planck-Institut für Bildungsforschung.

Baumert, J., Heyn. S. and Köller, O. (1994). DasKieler Lernstrategien-Inventar. Kiel: Institut fürdie Pädagogik der Naturwissenschaften an derUniversität Kiel.

BMDP. (1992). BMDP Statistical Software. LosAngeles: Author.

Brennan, R.L. (1992). Elements of GeneralizabilityTheory. Iowa City: American College TestingProgram.

Browne, M.W. and Cudeck, R. (1993) AlternativeWays of Assessing Model Fit. In: K. Bollen andS. Long: Testing Structural Equation Models.Newbury Park: Sage.

Cochran, W.G. (1977). Sampling Techniques (3rd

edition). New York: Wiley.

Cronbach, L.J., Gleser, G.C., Nanda, H. andRajaratnam, N. (1972). The Dependability ofBehavioral Measurements: Theory ofGeneralizability for Scores and Profiles. NewYork: Wileyand

Dale, E. and Chall, J.S. (1948). A Formula forPredicting Readability. Columbus: Bureau ofEducational Research, Ohio State University.

Dechant, E. (1991). Understanding and TeachingReading: An Interactive Model. Hillsdale, NJ:Lawrence Erlbaum.

De Landsheere, G. (1973). Le test de closure,mesure de la lisibilité et de la compréhension.Bruxelles: Nathan-Labor.

Eignor, D., Taylor, C., Kirsch, I. and Jamieson, J.(1998). Development of a Scale for Assessing theLevel of Computer Familiarity of TOEFLStudents. (TOEFL Research Report No. 60).Princeton, NJ: Educational Testing Service.

Elley, W.B. (1992). How in the World do StudentsRead. International Association for the Evaluationof Educational Achievement: The Hague

Flesch, R. (1949). The Art of Readable Writing.New York: Harper & Row

Fry, E.B. (1977). Fry’s Readability graph:Clarification, validity and extension to level 17.Journal of Reading, 20, 242-252.

Ganzeboom, H.B.G., de Graaf, P.M. and Treiman,D.J. (1992). A standard international socio-economic index of occupational status. SocialScience Research, 21, 1-56.

Gifi, A. (1990). Nonlinear Multivariate Analysis.New York: Wiley.

Graesser, A.C., Millis, K.K. and Zwaan, R.A.(1997). Discourse comprehension. AnnualReview of Psychology, 48, 163-189.

Greenacre, M.J. (1984). Theory and Applications ofCorrespondence Analysis. London: Academic Press.

Gustafsson, J.E. and Stahl, P.A. (2000). STREAMSUser’s Guide, Version 2.5 for Windows. MölndalSweden: MultivariateWare.

Henry, G. (1975). Comment mesurer la lisibilité.Bruxelles: Nathan-Labor.

Hambleton, R., Merenda, P. and Spielberger, C.D.(in press). Adapting Educational andPsychological Tests for Cross-CulturalAssessment. Hillsdale, NJ: Lawrence Erlbaum.

International Assessment of Educational Progress.(1992). Learning Science. Princeton, NJ:Educational Testing Service.

International Labour Organisation. (1990).International Standard Classification ofOccupations: ISCO-88. Geneva: Author.

Jöreskog, K.G. and Sörbom, D. (1993). LISREL8User’s Reference Guide. Chicago: ScientificSoftware International.

Ap

pen

dic

es

259

Judkins, D.R. (1990). Fay’s Method for VarianceEstimation. Journal of Official Statistics, 6(3),223-239.

Kintsch, W. and van Dijk, T. (1978). Toward amodel of text comprehension and production.Psychological Review, 85, 363-394.

Kirsch, I. and Mosenthal, P. (1990). ExploringDocument Literacy: Variables underlying theperformance of young adults. Reading ResearchQuarterly, 25, 5-30

Kish, L. (1965). Survey Sampling. New York: Wiley.

Longford, N.T. (1995). Models of Uncertainty inEducational Testing. New York: Springer-Verlag.

Macaskill, G., Adams, R.J. and Wu, M.L. (1998).Scaling methodology and procedures for themathematics and science literacy, advancedmathematics and physics scales. In: M. Martinand D.L. Kelly (eds.) Third InternationalMathematics and Science Study, TechnicalReport Volume 3: Implementation and Analysis.Chestnut Hill, MA: Center for the Study ofTesting, Evaluation and Educational Policy,Boston College.

McCormick, T.W. (1988). Theories of Reading inDialogue: An Interdisciplinary Study. New York:University Press of America.

McCullagh, P. and Nelder, J.A. (1983). GeneralizedLinear Models (2nd edition). London: Chapmanand Hall.

Marsh, H.W., Shavelson, R.J. and Byrne, B.M.(1992). A multidimensional, hierarchical self-concept. In: R.P. Lipka and T.M. Brinthaupt(eds.), Studying the Self: Self-perspectives Acrossthe Life-Span. Albany: State University of NewYork Press.

Mislevy, R.J. (1991). Randomization-basedinference about latent variables from complexsamples. Psychometrika, 56, 177–196.

Mislevy, R.J. and Sheehan, K.M. (1987). Marginalestimation procedures. In: A.E. Beaton (ed.), TheNAEP 1983-1984 Technical Report. (ReportNo. 15-TR-20). Princeton, NJ: EducationalTesting Service, 293-360.

Mislevy, R.J. and Sheehan, K.M. (1989).Information matrices in latent-variable models.Journal of Educational Statistics, 14(4), 335-350.

Mislevy, R.J., Beaton, A.E., Kaplan, B. andSheehan, K.M. (1992). Estimating populationcharacteristics from sparse matrix samples of

item responses. Journal of EducationalMeasurement, 29, 133–161.

Nishisato, S. (1980). Analysis of Categorical Data:Dual Scaling and its Applications. Toronto:University of Toronto Press.

Oh, H.L. and Sheuren, F.J. (1983). WeightingAdjustment for Unit Nonresponse. In: W.G.Madow, I. Olkin and D. Rubin (eds.) IncompleteData in Sample Surveys, Vol. 2: Theory andBibliographies. New York: Academic Press, 143-184.

Organisation for Economic Co-Operation andDevelopment (1999a). Measuring StudentKnowledge and Skills: A New Framework forAssessment. Paris: OECD Publications.

Organisation for Economic Co-Operation andDevelopment (1999b). Classifying EducationalProgrammes. Manual for ISCED-97Implementation in OECD Countries. Paris:OECD Publications.

Organisation for Economic Co-Operation andDevelopment (2000). Literacy in theInformation Age. Paris: OECD Publications.

Organisation for Economic Co-Operation andDevelopment (2001). Knowledge and Skills forLife: First Results from PISA 2000. Paris: OECDPublications.

Organisation for Economic Co-Operation andDevelopment (2002a). Manual for the PISA2000 Database. Paris OECD Publications.

Organisation for Economic Co-Operation andDevelopment (2002b). Sample Tasks from thePISA 2000 Assessment. Paris OECDPublications.

Owens, L. and Barnes, J. (1992). LearningPreferences Scales. Hawthorn, Vic.: AustralianCouncil for Educational Research.

Peschar, J.L. (1997). Prepared for Life? How toMeasure Cross-Curricular Competencies. Paris:OECD Publications.

Pintrich, P.R., Smith, D.A.F., Garcia, T. andMcKeachie, W.J. (1993). Reliability andpredictive validity of the Motivated Strategies forLearning questionnaire (MLSQ). Educationaland Psychological Measurement, 53, 801-813.

Robitaille, D.F. and Garden, R.A. (eds.) (1989).The IEA Study of Mathematics II: Contexts andOutcomes of School Mathematics. Oxford:Pergamon Press.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

260

Rubin, D.B. (1987). Multiple Imputations for Non-Response in Surveys. New York: Wiley.

Rumelhart, D.E. (1985). Toward an interactivemodel of reading. In: H. Singer, and R.B.Ruddell (eds.), Theoretical models and theprocesses of reading (3rd edition). Newark, DE:International Reading Association.

Rust, K. (1985). Variance estimation for complexestimators in sample surveys. Journal of OfficialStatistics, 1(4), 381-397.

Rust, K.F. and Rao, J.N.K. (1996). Varianceestimation for complex surveys using replicationtechniques. Statistical Methods in MedicalResearch, 5(3), 283-310.

Särndal, C.E., Swensson, B. and Wretman, J.(1992). Model Assisted Survey Sampling. NewYork: Springer-Verlag.

Searle, S.R. (1971). Linear Models. New York:Wiley.

Shao, J. (1996). Resampling methods in samplesurveys (with discussion). Statistics, 27, 203-254.

Sirotnik, K. (1970). An analysis of varianceframework for matrix sampling. Educationaland Psychological Measurement, 30, 891-908.

Spache, G. (1953). A new readability formula forprimary grade reading materials. ElementarySchool Journal, 55, 410-413.

Thorndike, R.L. (1973). Reading ComprehensionEducation in Fifteen Countries. Uppsala:Almqvist & Wiksell.

Travers, K.J. and Westbury, I. (eds.) (1989). TheIEA Study of Mathematics I: Analysis ofMathematics Curricula. Oxford: PergamonPress.

Van Dijk, T.A. and Kintsch, W. (1983). Strategiesof Discourse Comprehension. Orlando:Academic Press.

Verhelst, N.D. (2000). Estimating VarianceComponents in Unbalanced Designs. (R&DNotities, 2000-1). Arnhem: Cito Group.

Warm, T.A. (1985). Weighted maximum likelihoodestimation of ability in Item Response Theorywith tests of finite length. (Technical ReportCGI-TR-85-08). Oklahoma City: U.S. CoastGuard Institute.

Westat (2000). WesVar 4.0 User’s Guide. Rockville,MD: Author.

Wolter, K.M. (1985). Introduction to VarianceEstimation. New York: Springer-Verlag.

Wu, M.L. (1997). The developement andapplication of a fit test for use with generaliseditem response model. Masters of EducationDissertation. University of Melbourne.

Wu, M.L., Adams, R.J. and Wilson, M.R. (1997).ConQuest: Multi-Aspect Test Software[computer program]. Camberwell, Vic.:Australian Council for Educational Research.

Ap

pen

dic

es

261

PIS

A 2

00

0 t

ech

nic

al r

epor

t

262

Ap

pe

nd

ix 1

: S

um

ma

ry

of

PIS

A R

ea

din

g L

ite

ra

cy

It

em

s

Iden

tifi

cati

onN

ame

Sour

ceL

angu

age

Sub-

Num

ber

in w

hich

Sc

ale

Clu

ster

Inte

rnat

iona

l It

em P

aram

eter

sT

hres

hold

sSu

bmit

ted

Per

Cen

tC

orre

ctD

iffi

cult

yT

au-1

Tau

-2T

au-3

12

3

R04

0Q02

Lak

e C

had

AC

ER

Eng

lish

Ret

r. in

f.R

964

.82

(0.3

9)-0

.221

478

R04

0Q03

AL

ake

Cha

dA

CE

RE

nglis

hR

etr.

inf.

R9

50.3

9 (0

.39)

0.45

954

0R

040Q

03B

Lak

e C

had

AC

ER

Eng

lish

Refl

. & ev

al.R

936

.58

(0.3

4)1.

123

600

R04

0Q04

Lak

e C

had

AC

ER

Eng

lish

Int.

tex

tsR

976

.94

(0.2

8)-1

.112

397

R04

0Q06

Lak

e C

had

AC

ER

Eng

lish

Int.

tex

tsR

956

.44

(0.3

7)0.

106

508

R05

5Q01

Dru

gged

Spi

ders

Cit

oE

nglis

hIn

t. t

exts

R5

83.7

9 (0

.22)

-1.3

7737

3R

055Q

02D

rugg

ed S

pide

rsC

ito

Eng

lish

Refl

. & ev

al.R

552

.93

(0.2

7)0.

496

543

R05

5Q03

Dru

gged

Spi

ders

Cit

oE

nglis

hIn

t. t

exts

R5

60.5

7 (0

.29)

0.06

750

4R

055Q

05D

rugg

ed S

pide

rsC

ito

Eng

lish

Int.

tex

tsR

577

.45

(0.2

6)-0

.877

419

R06

1Q01

Mac

ondo

AC

ER

Span

ish

Int.

tex

tsR

665

.98

(0.2

9)-0

.112

488

R06

1Q03

Mac

ondo

AC

ER

Span

ish

Int.

tex

tsR

662

.33

(0.2

7)0.

005

499

R06

1Q04

Mac

ondo

AC

ER

Span

ish

Int.

tex

tsR

668

.36

(0.2

6)-0

.313

470

R06

1Q05

Mac

ondo

AC

ER

Span

ish

Refl

. & ev

al.R

654

.25

(0.3

3)0.

499

544

R06

7Q01

Aes

opG

reec

eG

reek

Int.

tex

tsR

588

.35

(0.1

9)-1

.726

341

R06

7Q04

Aes

opG

reec

eG

reek

Refl

. & ev

al.R

554

.31

(0.2

2)0.

516

-0.4

560.

456

479

611

R06

7Q05

Aes

opG

reec

eG

reek

Refl

. & ev

al.R

562

.48

(0.2

9)0.

182

0.48

2-0

.482

487

543

R07

0Q02

Bea

chC

ito

Eng

lish

Ret

r. in

f.R

152

.74

(0.3

2)0.

485

542

R07

0Q03

Bea

chC

ito

Eng

lish

Ret

r. in

f.R

170

.46

(0.2

5)-0

.487

454

R07

0Q04

Bea

chC

ito

Eng

lish

Refl

. & ev

al.R

117

.52

(0.2

2)2.

581

733

R07

0Q07

TB

each

Cit

oE

nglis

hIn

t. t

exts

R1

54.3

5 (0

.26)

0.42

8-0

.122

0.12

248

858

6R

076Q

03Ir

an A

irC

ito

Eng

lish

Ret

r. in

f.R

568

.58

(0.3

0)-0

.284

473

R07

6Q04

Iran

Air

Cit

oE

nglis

hIn

t. t

exts

R5

53.8

8 (0

.28)

0.47

554

2R

076Q

05Ir

an A

irC

ito

Eng

lish

Ret

r. in

f.R

552

.61

(0.2

8)0.

528

546

R07

7Q02

Flu

AC

ER

Eng

lish

Ret

r. in

f.R

970

.02

(0.3

4)-0

.607

443

R07

7Q03

Flu

AC

ER

Eng

lish

Refl

. & ev

al.R

944

.28

(0.3

4)0.

706

0.79

8-0

.798

542

583

R07

7Q04

Flu

AC

ER

Eng

lish

Int.

tex

tsR

953

.34

(0.3

2)0.

247

521

R07

7Q05

Flu

AC

ER

Eng

lish

Refl

. & ev

al.R

930

.73

(0.2

9)1.

521

637

R07

7Q06

Flu

AC

ER

Eng

lish

Int.

tex

tsR

944

.68

(0.4

0)0.

699

562

R08

1Q01

Gra

ffit

iFi

nlan

dFi

nnis

hIn

t. t

exts

R1

76.3

4 (0

.23)

-0.8

5242

1R

081Q

02G

raff

iti

Finl

and

Finn

ish

Int.

tex

tsR

1R

081Q

05G

raff

iti

Finl

and

Finn

ish

Int.

tex

tsR

152

.65

(0.2

6)0.

475

542

R08

1Q06

AG

raff

iti

Finl

and

Finn

ish

Refl

. & ev

al.R

167

.14

(0.2

6)-0

.303

471

R08

1Q06

BG

raff

iti

Finl

and

Finn

ish

Refl

. & ev

al.R

144

.75

(0.3

1)0.

906

581

R08

3Q01

Hou

seho

ld W

ork

AC

ER

Eng

lish

Int.

tex

tsR

666

.36

(0.2

8)-0

.163

484

Iden

tifi

cati

onN

ame

Sour

ceL

angu

age

Sub-

Num

ber

in w

hich

Sc

ale

Clu

ster

Inte

rnat

iona

l It

em P

aram

eter

sT

hres

hold

sSu

bmit

ted

Per

Cen

tC

orre

ctD

iffi

cult

yT

au-1

Tau

-2T

au-3

12

3

R08

3Q02

Hou

seho

ld W

ork

AC

ER

Eng

lish

Ret

r. in

f.R

685

.12

(0.2

1)-1

.422

369

R08

3Q03

Hou

seho

ld W

ork

AC

ER

Eng

lish

Ret

r. in

f.R

680

.53

(0.2

3)-1

.036

404

R08

3Q04

Hou

seho

ld W

ork

AC

ER

Eng

lish

Int.

tex

tsR

668

.06

(0.2

8)-0

.198

480

R08

3Q06

Hou

seho

ld W

ork

AC

ER

Eng

lish

Refl

. & ev

al.R

647

.17

(0.3

2)0.

845

575

R08

6Q04

IfA

CE

RE

nglis

hR

efl. &

eval.

R4

29.0

0 (0

.24)

1.79

166

1R

086Q

05If

AC

ER

Eng

lish

Int.

tex

tsR

483

.48

(0.2

2)-1

.367

374

R08

6Q07

IfA

CE

RE

nglis

hR

efl. &

eval.

R4

72.7

4 (0

.26)

-0.6

8343

6R

088Q

01L

abou

rN

Z’la

ndE

nglis

hIn

t. t

exts

R9

62.7

6 (0

.34)

-0.2

3547

7R

088Q

03L

abou

rN

Z’la

ndE

nglis

hR

etr.

inf.

R9

45.9

7 (0

.27)

0.65

8-0

.578

0.57

848

563

1R

088Q

04T

Lab

our

N Z

’land

Eng

lish

Int.

tex

tsR

939

.20

(0.2

5)1.

117

-1.3

361.

336

473

727

R08

8Q05

TL

abou

rN

Z’la

ndE

nglis

hR

efl. &

eval.

R9

68.8

3 (0

.32)

-0.5

8844

5R

088Q

07L

abou

rN

Z’la

ndE

nglis

hR

efl. &

eval.

R9

61.6

2 (0

.37)

-0.1

3548

6R

091Q

05L

ibra

ryA

CE

RE

nglis

hR

etr.

inf.

R3

85.7

8 (0

.21)

-1.4

8436

3R

091Q

06L

ibra

ryA

CE

RE

nglis

hIn

t. t

exts

R3

86.4

2 (0

.18)

-1.5

9735

3R

091Q

07B

Lib

rary

AC

ER

Eng

lish

Refl

. & ev

al.R

379

.45

(0.2

1)-1

.039

404

R09

3Q03

New

s A

genc

ies

Den

mar

kD

anis

hIn

t. t

exts

R5

74.6

9 (0

.26)

-0.5

6344

7R

093Q

04N

ews

Age

ncie

sD

enm

ark

Dan

ish

Int.

tex

tsR

5R

099Q

02Pl

an I

nter

nati

onal

AC

ER

Eng

lish

Int.

tex

tsR

7R

099Q

03Pl

an I

nter

nati

onal

AC

ER

Eng

lish

Int.

tex

tsR

7R

099Q

04B

Plan

Int

erna

tion

alA

CE

RE

nglis

hR

efl. &

eval.

R7

10.6

6 (0

.16)

2.91

6-0

.316

0.31

670

582

2R

100Q

04Po

lice

Bel

gium

Fren

chR

etr.

inf.

R6

61.1

3 (0

.29)

0.17

851

5R

100Q

05Po

lice

Bel

gium

Fren

chIn

t. t

exts

R6

58.2

6 (0

.28)

0.21

751

8R

100Q

06Po

lice

Bel

gium

Fren

chIn

t. t

exts

R6

80.2

4 (0

.24)

-1.0

1240

6R

100Q

07Po

lice

Bel

gium

Fren

chIn

t. t

exts

R6

80.6

5 (0

.24)

-1.0

6540

2R

101Q

01R

hino

Swed

enSw

edis

hIn

t. t

exts

R1

54.9

2 (0

.30)

0.45

754

0R

101Q

02R

hino

Swed

enSw

edis

hIn

t. t

exts

R1

86.1

1 (0

.17)

-1.6

2735

0R

101Q

03R

hino

Swed

enSw

edis

hR

efl. &

eval.

R1

63.9

3 (0

.28)

-0.0

8249

1R

101Q

04R

hino

Swed

enSw

edis

hIn

t. t

exts

R1

78.3

2 (0

.26)

-1.0

0240

7R

101Q

05R

hino

Swed

enSw

edis

hIn

t. t

exts

R1

47.4

1 (0

.30)

0.80

057

1R

101Q

08R

hino

Swed

enSw

edis

hIn

t. t

exts

R1

55.8

1 (0

.28)

0.38

053

3R

102Q

01Sh

irts

Cit

oE

nglis

hIn

t. t

exts

R4

81.1

0 (0

.24)

-1.3

0638

0R

102Q

04A

Shir

tsC

ito

Eng

lish

Int.

tex

tsR

436

.00

(0.2

9)1.

206

608

R10

2Q05

Shir

tsC

ito

Eng

lish

Int.

tex

tsR

441

.80

(0.3

0)0.

905

581

R10

2Q06

Shir

tsC

ito

Eng

lish

Refl

. & ev

al.R

433

.44

(0.2

5)1.

481

633

R10

2Q07

Shir

tsC

ito

Eng

lish

Int.

tex

tsR

485

.23

(0.2

2)-1

.566

356

R10

4Q01

Tele

phon

eN

Z’la

ndE

nglis

hR

etr.

inf.

R6

82.6

3 (0

.23)

-1.2

3538

6

Ap

pen

dic

es

263

Iden

tifi

cati

onN

ame

Sour

ceL

angu

age

S

ub-

Num

ber

in w

hich

Sc

ale

Clu

ster

Inte

rnat

iona

l It

em P

aram

eter

sT

hres

hold

sSu

bmit

ted

Per

Cen

tC

orre

ctD

iffi

cult

yT

au-1

Tau

-2T

au-3

12

3

R10

4Q02

Tele

phon

eN

ew Z

eala

ndE

nglis

hR

etr.

inf.

R6

41.3

0 (0

.29)

1.10

559

9R

104Q

05Te

leph

one

New

Zea

land

Eng

lish

Ret

r. in

f.R

628

.89

(0.1

8)1.

875

-0.9

140.

914

574

764

R10

4Q06

Tele

phon

eN

ew Z

eala

ndE

nglis

hR

etr.

inf.

R6

79.5

1 (0

.22)

-0.9

3141

4R

110Q

01R

unne

rsB

elgi

umFr

ench

Int.

tex

tsR

884

.50

(0.2

4)-1

.561

356

R11

0Q04

Run

ners

Bel

gium

Fren

chR

etr.

inf.

R8

78.8

4 (0

.27)

-1.1

6639

2R

110Q

05R

unne

rsB

elgi

umFr

ench

Ret

r. in

f.R

876

.21

(0.2

4)-1

.025

405

R11

0Q06

Run

ners

Bel

gium

Fren

chR

efl. &

eval.

R8

77.7

8 (0

.23)

-1.0

5940

2R

111Q

01E

xcha

nge

Finl

and

Finn

ish

Int.

tex

tsR

463

.87

(0.2

8)-0

.053

494

R11

1Q02

BE

xcha

nge

Finl

and

Finn

ish

Refl

. & ev

al.R

434

.14

(0.2

4)1.

365

-0.5

540.

554

551

694

R11

1Q04

Exc

hang

eFi

nlan

dFi

nnis

hR

etr.

inf.

R4

77.7

0 (0

.25)

-0.8

6841

9R

111Q

06B

Exc

hang

eFi

nlan

dFi

nnis

hR

efl. &

eval.

R4

44.4

2 (0

.29)

0.80

80.

828

-0.8

2855

259

2R

119Q

01G

ift

USA

Eng

lish

Int.

tex

tsR

373

.46

(0.2

3)-0

.563

447

R11

9Q04

Gif

tU

SAE

nglis

hIn

t. t

exts

R3

40.3

8 (0

.30)

1.14

960

3R

119Q

05G

ift

USA

Eng

lish

Refl

. & ev

al.R

337

.10

(0.2

5)1.

226

0.03

0-0

.030

567

652

R11

9Q06

Gif

tU

SAE

nglis

hR

etr.

inf.

R3

85.2

9 (0

.22)

-1.4

4836

7R

119Q

07G

ift

USA

Eng

lish

Int.

tex

tsR

342

.81

(0.2

6)1.

031

-0.2

160.

216

539

645

R11

9Q08

Gif

tU

SAE

nglis

hIn

t. t

exts

R3

56.6

4 (0

.28)

0.33

852

9R

119Q

09T

Gif

tU

SAE

nglis

hR

efl. &

eval.

R3

63.9

8 (0

.28)

0.11

60.

450

-0.4

5048

053

7R

120Q

01St

uden

t O

pini

ons

AC

ER

Eng

lish

Int.

tex

tsR

757

.89

(0.2

5)0.

202

517

R12

0Q03

Stud

ent

Opi

nion

sA

CE

RE

nglis

hIn

t. t

exts

R7

43.0

4 (0

.25)

1.00

459

0R

120Q

06St

uden

t O

pini

ons

AC

ER

Eng

lish

Refl

. & ev

al.R

761

.33

(0.2

5)0.

105

508

R12

0Q07

TSt

uden

t O

pini

ons

AC

ER

Eng

lish

Refl

. & ev

al.R

771

.82

(0.2

4)-0

.629

441

R12

2Q01

AJu

star

tA

CE

RE

nglis

hR

5R

122Q

01B

Just

art

AC

ER

Eng

lish

R5

R12

2Q01

CJu

star

tA

CE

RE

nglis

hR

5R

122Q

02Ju

star

tA

CE

RE

nglis

hR

efl. &

eval.

R5

59.8

2 (0

.29)

0.13

851

1R

122Q

03T

Just

art

AC

ER

Eng

lish

Ret

r. in

f.R

543

.48

(0.2

5)0.

894

-0.0

290.

029

535

625

R21

6Q01

Am

anda

&

Fran

ceFr

ench

Int.

tex

tsR

973

.54

(0.3

6)-0

.833

423

the

Duc

hess

R21

6Q02

Am

anda

&

Fran

ceFr

ench

Refl

. & ev

al.R

944

.22

(0.4

1)0.

690

561

the

Duc

hess

R21

6Q03

TA

man

da &

Fr

ance

Fren

chIn

t. t

exts

R9

43.9

0 (0

.35)

0.75

156

7th

e D

uche

ssR

216Q

04A

man

da &

Fr

ance

Fren

chR

etr.

inf.

R9

36.6

1 (0

.35)

1.20

660

8th

e D

uche

ssR

R21

6Q06

Am

anda

&

Fran

ceFr

ench

Int.

tex

tsR

967

.10

(0.3

2)-0

.474

455

the

Duc

hess

PIS

A 2

00

0 t

ech

nic

al r

epor

t

264

Iden

tifi

cati

onN

ame

Sour

ceL

angu

age

S

ub-

Num

ber

in w

hich

Sc

ale

Clu

ster

Inte

rnat

iona

l It

em P

aram

eter

sT

hres

hold

sSu

bmit

ted

Per

Cen

tC

orre

ctD

iffi

cult

yT

au-1

Tau

-2T

au-3

12

3

R21

9Q01

EE

mpl

oym

ent

IAL

SIA

LS

Ret

r. in

f.R

169

.94

(0.2

7)-0

.550

448

R21

9Q01

TE

mpl

oym

ent

IAL

SIA

LS

Int.

tex

tsR

157

.37

(0.3

1)0.

278

524

R21

9Q02

Em

ploy

men

tIA

LS

IAL

SR

efl. &

eval.

R1

76.2

4 (0

.27)

-0.9

1741

5R

220Q

01So

uth

Pole

Fran

ceFr

ench

Ret

r. in

f.R

746

.03

(0.3

1)0.

785

570

R22

0Q02

BSo

uth

Pole

Fran

ceFr

ench

Int.

tex

tsR

764

.49

(0.3

0)-0

.144

485

R22

0Q04

Sout

h Po

leFr

ance

Fren

chIn

t. t

exts

R7

60.6

7 (0

.28)

0.16

351

3R

220Q

05So

uth

Pole

Fran

ceFr

ench

Int.

tex

tsR

784

.88

(0.2

1)-1

.599

353

R22

0Q06

Sout

h Po

leFr

ance

Fren

chIn

t. t

exts

R7

65.5

4 (0

.27)

-0.1

7248

3R

225Q

01N

ucle

arIA

LS

IAL

SIn

t. t

exts

R2

R22

5Q02

Nuc

lear

IAL

SIA

LS

Int.

tex

tsR

270

.77

(0.2

5)-0

.313

470

R22

5Q03

Nuc

lear

IAL

SIA

LS

Ret

r. in

f.R

289

.87

(0.1

7)-1

.885

327

R22

5Q04

Nuc

lear

IAL

SIA

LS

Ret

r. in

f.R

275

.80

(0.2

8)-0

.663

438

R22

7Q01

Opt

icia

nSw

itze

rlan

dG

erm

anIn

t. t

exts

R4

57.6

5 (0

.34)

0.19

651

6R

227Q

02T

Opt

icia

nSw

itze

rlan

dG

erm

anR

etr.

inf.

R4

59.5

8 (0

.20)

0.04

5-1

.008

1.00

840

160

4R

227Q

03O

ptic

ian

Swit

zerl

and

Ger

man

Refl

. & ev

al.R

455

.58

(0.3

2)0.

295

525

R22

7Q04

Opt

icia

nSw

itze

rlan

dG

erm

anIn

t. t

exts

R4

69.8

5 (0

.28)

-0.2

160.

457

-0.4

5745

050

7R

227Q

06O

ptic

ian

Swit

zerl

and

Ger

man

Ret

r. in

f.R

474

.29

(0.2

6)-0

.916

415

R22

8Q01

Gui

deFi

nlan

dFi

nnis

hIn

t. t

exts

R7

67.2

4 (0

.27)

-0.3

3546

8R

228Q

02G

uide

Finl

and

Finn

ish

Int.

tex

tsR

759

.36

(0.2

3)0.

194

516

R22

8Q04

Gui

deFi

nlan

dFi

nnis

hIn

t. t

exts

R7

57.0

3 (0

.30)

0.24

752

1R

234Q

01pe

rson

nel

IAL

SIA

LS

Ret

r. in

f.R

285

.14

(0.2

0)-1

.487

363

R23

4Q02

pers

onne

lIA

LS

IAL

SR

etr.

inf.

R2

31.7

6 (0

.28)

1.72

865

5R

236Q

01ne

wru

les

IAL

SIA

LS

Int.

tex

tsR

847

.88

(0.3

1)0.

660

558

R23

6Q02

new

rule

sIA

LS

IAL

SIn

t. t

exts

R8

25.5

2 (0

.25)

1.87

366

9R

237Q

01H

irin

g In

terv

iew

IAL

SIA

LS

Ret

r. in

f.R

859

.59

(0.2

6)-0

.005

498

R23

7Q03

Hir

ing

Inte

rvie

wIA

LS

IAL

SIn

t. t

exts

R8

57.6

0 (0

.30)

0.13

251

0R

238Q

01B

icyc

leIA

LS

IAL

SR

etr.

inf.

R2

65.0

7 (0

.30)

-0.0

9249

0R

238Q

02B

icyc

leIA

LS

IAL

SIn

t. t

exts

R2

49.4

6 (0

.27)

0.72

256

4R

239Q

01A

llerg

ies/

Exp

lore

rsIA

LS

IAL

SIn

t. t

exts

R8

52.3

8 (0

.30)

0.46

954

1R

239Q

02A

llerg

ies/

Exp

lore

rsIA

LS

IAL

SR

etr.

inf.

R8

66.7

8 (0

.27)

-0.2

5747

5R

241Q

01W

arra

ntyh

otpo

int

IAL

SIA

LS

Ret

r. in

f.R

2R

241Q

02W

arra

ntyh

otpo

int

IAL

SIA

LS

Int.

tex

tsR

257

.59

(0.3

0)0.

432

538

R24

5Q01

Mov

ie R

evie

ws

IAL

SIA

LS

Ret

r. in

f.R

272

.69

(0.2

5)-0

.556

448

R24

5Q02

Mov

ie R

evie

ws

IAL

SIA

LS

Int.

tex

tsR

266

.63

(0.2

8)-0

.043

494

R24

6Q01

Con

tact

empl

oyer

IAL

SIA

LS

Ret

r. in

f.R

875

.08

(0.2

9)-0

.805

425

R24

6Q02

Con

tact

empl

oyer

IAL

SIA

LS

Ret

r. in

f.R

831

.96

(0.3

1)1.

560

640

Ap

pen

dic

es

265

Ap

pe

nd

ix 2

:S

um

ma

ry

of

PIS

A M

at

he

ma

tic

al

Lit

er

ac

y I

te

ms

Iden

tifi

cati

onN

ame

Sour

ceL

angu

age

Num

ber

in w

hich

C

lust

erIn

tern

atio

nal

Item

Par

amet

ers

Thr

esho

lds

Subm

itte

dPe

r C

ent

Cor

rect

Dif

ficu

lty

Tau

-1T

au-2

Tau

-31

23

M03

3Q01

A V

iew

Roo

mC

ito

Dut

chM

374

.03

(0.2

7)-1

.520

442

M03

4Q01

TB

rick

sC

ito

Dut

chM

338

.72

(0.2

8)0.

459

593

M03

7Q01

TFa

rms

Cit

oD

utch

M1

60.9

6 (0

.42)

-0.8

6749

2M

037Q

02T

Farm

sC

ito

Dut

chM

155

.16

(0.3

2)-0

.453

524

M12

4Q01

Wal

king

Cit

oD

utch

M1

34.4

1 (0

.40)

0.53

659

9M

124Q

03T

Wal

king

Cit

oD

utch

M1

18.8

9 (0

.25)

1.27

6-0

.258

0.16

90.

089

600

659

708

M13

6Q01

TA

pple

sC

ito

Dut

chM

249

.14

(0.2

8)-0

.139

548

M13

6Q02

TA

pple

sC

ito

Dut

chM

224

.88

(0.2

6)1.

265

655

M13

6Q03

TA

pple

sC

ito

Dut

chM

213

.19

(0.1

8)1.

820

0.38

5-0

.385

672

723

M14

4Q01

TC

ube

Pain

ting

USA

Eng

lish

M1

63.8

6 (0

.33)

-1.0

1248

1M

144Q

02T

Cub

e Pa

inti

ngU

SAE

nglis

hM

126

.26

(0.3

2)1.

090

641

M14

4Q03

Cub

e Pa

inti

ngU

SAE

nglis

hM

176

.82

(0.3

3)-1

.815

420

M14

4Q04

TC

ube

Pain

ting

USA

Eng

lish

M1

37.3

1 (0

.38)

0.43

159

1M

145Q

01T

Cub

esC

ito

Dut

chM

358

.28

(0.3

2)-0

.559

516

M14

8Q01

TC

onti

nent

Are

aSw

eden

Swed

ish

M2

M14

8Q02

TC

onti

nent

Are

aSw

eden

Swed

ish

M2

19.3

3 (0

.20)

1.47

4-0

.138

0.13

862

971

2M

150Q

01G

row

ing

Up

Cit

oD

utch

M2

61.5

7 (0

.28)

-0.6

8650

6M

150Q

02T

Gro

win

g U

pC

ito

Dut

chM

269

.36

(0.2

6)-1

.129

-0.5

020.

502

415

529

M15

0Q03

TG

row

ing

Up

Cit

oD

utch

M2

45.7

9 (0

.29)

0.00

955

9M

155Q

01Po

pula

tion

Pyr

amid

sC

ito

Dut

chM

360

.46

(0.3

4)-0

.745

501

M15

5Q02

TPo

pula

tion

Pyr

amid

sC

ito

Dut

chM

359

.49

(0.3

2)-0

.643

0.48

1-0

.481

486

532

M15

5Q03

TPo

pula

tion

Pyr

amid

sC

ito

Dut

chM

314

.54

(0.2

2)1.

718

0.22

9-0

.229

660

719

M15

5Q04

TPo

pula

tion

Pyr

amid

sC

ito

Dut

chM

351

.33

(0.2

7)-0

.230

541

M15

9Q01

Rac

ing

Car

Aus

tria

Ger

man

M4

67.2

2 (0

.33)

-0.8

6149

2M

159Q

02R

acin

g C

arA

ustr

iaG

erm

anM

483

.71

(0.2

6)-2

.037

403

M15

9Q03

Rac

ing

Car

Aus

tria

Ger

man

M4

82.8

1 (0

.22)

-1.8

9741

3M

159Q

05R

acin

g C

arA

ustr

iaG

erm

anM

428

.34

(0.3

3)1.

266

655

M16

1Q01

Tri

angl

esU

SAE

nglis

hM

458

.48

(0.3

2)-0

.283

537

M17

9Q01

TR

obbe

ries

TIM

SSE

nglis

hM

426

.31

(0.2

4)1.

328

-0.3

600.

360

609

710

M19

2Q01

TC

onta

iner

sG

erm

any

Ger

man

M3

37.8

1 (0

.28)

0.47

759

5M

266Q

01T

Car

pent

erA

CE

RE

nglis

hM

419

.92

(0.2

5)1.

858

700

M27

3Q01

TPi

pelin

esC

zech

Rep

.C

zech

M4

54.1

7 (0

.31)

-0.1

3054

8

PIS

A 2

00

0 t

ech

nic

al r

epor

t

266

Ap

pe

nd

ix 3

:S

um

ma

ry

of

PIS

A S

cie

nt

ific

Lit

er

ac

y I

te

ms

Iden

tifi

cati

onN

ame

Sour

ceL

angu

age

Num

ber

in w

hich

C

lust

erIn

tern

atio

nal

Item

Par

amet

ers

Thr

esho

lds

Subm

itte

dPe

r C

ent

Cor

rect

Dif

ficu

lty

Tau

-1T

au-2

Tau

-31

23

S114

Q03

TG

reen

hous

eC

ito

Eng

lish

S156

.79

(0.3

5)-0

.373

520

S114

Q04

TG

reen

hous

eC

ito

Eng

lish

S139

.29

(0.3

0)0.

377

-0.0

500.

050

545

645

S114

Q05

TG

reen

hous

eC

ito

Eng

lish

S124

.60

(0.3

2)1.

307

688

S128

Q01

Clo

ning

Cit

oFr

ench

S261

.64

(0.2

8)-0

.557

502

S128

Q02

Clo

ning

Cit

oFr

ench

S245

.38

(0.2

8)0.

284

586

S128

Q03

TC

loni

ngC

ito

Fren

chS2

61.1

6 (0

.27)

-0.5

2750

5S1

29Q

01D

aylig

htA

CE

RE

nglis

hS4

38.7

5 (0

.33)

0.62

061

9S1

29Q

02T

Day

light

AC

ER

Eng

lish

S417

.82

(0.2

5)1.

497

0.67

3-0

.673

682

732

S131

Q02

TG

ood

Vib

rati

ons

AC

ER

Eng

lish

S250

.51

(0.2

8)0.

028

560

S131

Q04

TG

ood

Vib

rati

ons

AC

ER

Eng

lish

S224

.74

(0.2

4)1.

438

701

S133

Q01

Res

earc

hU

SAE

nglis

hS1

56.2

2 (0

.33)

-0.3

5652

2S1

33Q

03R

esea

rch

USA

Eng

lish

S142

.24

(0.3

0)0.

313

589

S133

Q04

TR

esea

rch

USA

Eng

lish

S143

.25

(0.3

4)0.

250

582

S195

Q02

TSe

mm

elw

eis’

Dia

ryC

ito

Dut

chS3

24.9

5 (0

.26)

0.94

61.

273

-1.2

7363

866

6S1

95Q

04Se

mm

elw

eis’

Dia

ryC

ito

Dut

chS3

63.3

3 (0

.27)

-0.6

4349

3S1

95Q

05T

Sem

mel

wei

s’ D

iary

Cit

oD

utch

S367

.58

(0.2

9)-0

.900

467

S195

Q06

Sem

mel

wei

s’ D

iary

Cit

oD

utch

S360

.04

(0.2

7)-0

.494

508

S209

Q01

TT

idal

Pow

erA

CE

RE

nglis

hS3

S209

Q02

TT

idal

Pow

erA

CE

RE

nglis

hS3

43.3

1 (0

.32)

0.34

659

2S2

13Q

01T

Clo

thes

Aus

tral

iaE

nglis

hS1

40.0

1 (0

.33)

0.41

959

9S2

13Q

02C

loth

esA

ustr

alia

Eng

lish

S175

.41

(0.3

1)-1

.484

409

S252

Q01

Sout

h R

aine

aK

orea

Kor

ean

S348

.29

(0.2

7)0.

026

560

S252

Q02

Sout

h R

aine

aK

orea

Kor

ean

S371

.98

(0.2

4)-1

.123

445

S252

Q03

TSo

uth

Rai

nea

Kor

eaK

orea

nS3

54.5

9 (0

.26)

-0.1

7654

0S2

53Q

01T

Ozo

neN

orw

ayN

orw

egia

nS4

28.3

7 (0

.30)

0.97

80.

609

-0.6

0962

868

2S2

53Q

02O

zone

Nor

way

Nor

weg

ian

S434

.78

(0.3

2)0.

849

642

S253

Q05

Ozo

neN

orw

ayN

orw

egia

nS4

53.7

9 (0

.32)

-0.1

0854

7S2

56Q

01Sp

oons

TIM

SSE

nglis

hS1

88.2

7 (0

.20)

-2.4

9130

8S2

68Q

01A

lgae

Swed

enSw

edis

hS4

73.4

1 (0

.29)

-1.2

5043

2S2

68Q

02T

Alg

aeSw

eden

Swed

ish

S439

.83

(0.3

8)0.

578

615

S268

Q06

Alg

aeSw

eden

Swed

ish

S457

.39

(0.4

1)-0

.236

534

S269

Q01

Ear

th’s

Tem

pera

ture

Cit

oD

utch

S259

.28

(0.2

9)-0

.460

511

S269

Q03

TE

arth

’s T

empe

ratu

reC

ito

Dut

chS2

41.3

6 (0

.31)

0.49

760

7S2

69Q

04T

Ear

th’s

Tem

pera

ture

Cit

oD

utch

S235

.30

(0.3

1)0.

712

629

S270

Q03

TO

zone

Bel

gium

Fren

chS4

56.4

2 (0

.34)

-0.2

8752

9

Ap

pen

dic

es

267

As part of the field trial, informal validity studieswere conducted in Canada, the Czech Republic,France and the United Kingdom. These studiesaddressed the question of how well 15-year-oldscould be expected to report the occupations oftheir parents.

Canada

The Canadian study found the followingcorrelation results:• between achievement and mother’s occupation

reported by student: 0.24;• between achievement and mother’s occupation

reported by mother in 80 per cent of casesand father in 20 per cent of cases: 0.20;

• between achievement and father’s occupationreported by student: 0.21; and

• between achievement and father’s occupationreported by father in 20 per cent of cases andmother in 80 per cent of cases: 0.29. (It should be noted that 80 per cent of theresponses here are proxy data.)

Czech Republic

The correlations between student and parent-reported parental occupational status (SEI) in thestudy in the Czech Republic were:• for mothers 0.87; and• for fathers 0.85.

France

The extent of agreement between the students’and the parents’ reports of the parents’occupations in France, grouped by ISCO MajorGroup, is shown in Table 88. This table couldcall the accuracy of student’s reports of theirparents’ occupations into question. However,two important factors should be noted. First, thetable includes non-response, which necessarilydecreases the level of agreement. Non-responseby parents ranged from 2 per cent to over 10 percent for the Manager group. Second, theresponses not in agreement were clusteredaround the main diagonal. So, for example,

PIS

A 2

00

0 t

ech

nic

al r

epor

t

268

Appendix 4: The Validity Studies of Occupation andSocio-economic Status (summary)

Table 88: Validity Study Results for France

Mother FatherISCO Major % Agreement Number % Agreement NumberGroup Group with Parent With Parent

1000 Legislators, Senior Officials, Managers 39 26 44 87

2000 Professionals 74 88 69 87

3000 Technicians, Associate Professionals 62 125 50 126

4000 Clerks 69 137 43 35

5000 Service/Sales 63 110 62 38

6000 Agricultural/Primary Industry 75 8 85 39

7000 Craft and Trades 64 17 76 130

8000 Operators, Drivers 67 22 58 66

9000 Elementary Occupations 57 57 43 37

Total 711 711

of students who responded that their father wasin the Craft and Trades group, 76 per cent of thefathers put their occupation in the same categoryand 11 per cent were in the next category(Operators). Of students who indicated thattheir mothers were professionals, their answerswere confirmed by their mothers in 74 per centof cases and another 19 per cent were self-reported para-professionals. Although studentswho indicated that their father’s occupation wasin the manager group had this confirmed bytheir fathers in only 44 per cent of cases, another30 per cent of the fathers were self-reportedprofessionals and para-professionals. Third, therewere very few irreconcilable dis-agreements. Forexample, in the previous example, no fatherwhose child classified him as a Manager self-reported himself in major groups 6 to 9.Similarly, of fathers reported as professionals bytheir children, very few (0.5 per cent) indicatedthat they worked as operatives or unskilledlabourers.

United Kingdom

In the study carried out in the United Kingdom,SEI scores of parents’ occupations weredeveloped from 3-digit ISCO codes.

The correlations based on the SEI scores ofoccupations reported by students and parentswere: • for mothers 0.76; and• for fathers 0.71.

The extent of agreement between the students’and the parents’ reports of the parents’occupations, grouped by ISCO Major Group, isshown in Table 89 for the United Kingdomstudy.

Correlations based on the SEI scoresdeveloped from the three-digit ISCO Codeswere:• between achievement and mother’s occupation

reported by student: 0.32;• between achievement and mother’s self-

reported occupation: 0.25;• between achievement and father’s occupation

reported by student: 0.36; and• between achievement and father’s self-reported

occupation: 0.38.

Ap

pen

dic

es

269

Table 89: Validity Study Results for the United Kingdom

Mother FatherISCO Major % Agreement Number % Agreement NumberGroup Group with Parent With Parent

1000 Legislators, Senior Officials, Managers 53 34 55 71

2000 Professionals 94 69 82 44

3000 Technicians, Associate Professionals 40 25 60 30

4000 Clerks 64 74 50 12

5000 Service/Sales 73 67 69 13

6000 Agricultural/Primary Industry 0 1 0 1

7000 Craft and Trades 20 5 68 41

8000 Operators, Drivers 83 6 71 21

9000 Elementary Occupations 60 20 56 9

Total 309 246

Not all pieces of information gathered by SchoolQuality Monitors (SQMs) or National CentreQuality Monitors (NCQMs) are used in this report.Items considered relevant to data quality and toconducting the testing sessions are included here inthe section on SQM reports. A more extensiverange of information is included in the section onNCQM reports.

School Quality Monitor Reports

A total of 103 SQMs submitted 893 reports thatwere processed by the Consortium. These camefrom 31 national centres (including the twolinguistic regions of Belgium, which are countedas two national centres in these analyses). Theform of the reports from one country was notamenable to data entry and they were thereforenot included in the quality monitoring analyses.

Administration of Tests and Questionnairesin Schools

SQMs were asked to record the time at keypoints before, during and after theadministration of the tests and questionnaires.The times, measured in minutes, were transformedinto duration to the nearest minute for:• Cognitive Blocks One and Two: Time taken

for the first and second parts of theassessment booklet (referred to in theseanalyses as ‘cognitive blocks’) consideredseparately (should each equal 60 minutes ifPISA procedures were completely followed);

• Questionnaire: Time taken for the StudentQuestionnaire;

• Student working time: Time taken for the firstand second cognitive blocks plus thequestionnaire (that is, total time spent bystudents in working on PISA materials). It wasexpected that this would be around 150minutes, excluding breaks between sessions andtime spent by the Test Administrator givinginstructions for conducting the session; and

• Total duration: Time from the entrance of thefirst student to the exit of the last student,which provides a measure of how long PISAtook out of a student’s school day.

Duration of cognitive blocksThe mean duration of each of the first and secondcognitive blocks was 60 minutes. Totals of 97 and93 per cent of schools, respectively, completed thefirst and second blocks in 60±2 minutes. For eachblock, only 1 per cent of schools took more than62 minutes, three of these using 70 minutes forthe first block and two using more than 70 minutes for the second block (longest time was79 minutes). Just over 2 per cent of schools forthe first block and 6 per cent for the second blocktook less than 58 minutes.

There was a group of 20 schools from onecountry that took only 50 minutes to completethe second block, which points to a systematicerror in administration procedures. Follow-up ofother schools that finished early revealed thatthe students had either lost interest or had givenup because they found the test too difficult.Some commented that the students gave upbecause they were unfamiliar with the format ofthe complex multiple-choice items.

In an attempt to explain the times in excess of62 minutes, cross-tabulations were constructedof the time variable with the ‘general disruption’variable (Q28), and with the ‘detection ofdefective booklets’ variable (Q34). None of theschools where the time exceeded 62 minutes wasreported to have had these problems.

Duration of Student Questionnaire sessionThe mean duration of the Student Questionnairesession was 36 minutes. A total of 39 per cent ofschools took 30 minutes to complete thequestionnaire. By 40 minutes, 77 per cent ofschools visited by SQMs had completed thequestionnaire. The minimum time taken tocomplete the session was 17 minutes (at oneschool), and the maximum was 95 minutes (alsoat one school). Figure 82 shows the statistics andhistogram summarising these data.

No school was reported as having thequestionnaire session and the cognitive blocksessions on different days.

Duration of student working timeThis section is concerned with the time students

spent working on PISA materials, excluding breaksbetween sessions and time spent by the TestP

ISA

20

00

tec

hn

ical

rep

ort

270

Appendix 5: The Main Findings from Quality Monitoringof the PISA 2000 Main Study

Administrator giving instructions for the session.It was expected that this time would be around150 minutes. About 15 per cent of schools took150 minutes, and about 24 per cent took less than150 minutes. Thus, the majority of schools (about60 per cent) took longer than was expected al-though only a third took longer than 155 minutesand the percentage drops rapidly beyond 160 min-utes. Most of the additional time was taken up incompleting the Student Questionnaire. The inclu-sion of various national and international optionswith the questionnaire makes it difficult to establishthe reasons for the long duration of the workingperiod for many schools. Figure 83 shows thestatistics and histogram summarising these data.

Ap

pen

dic

es

271

Count Midpoint One symbol equals approximately 8.00 occurrences

1 16 |20 20 |***50 24 |******85 28 |***********

237 32 |******************************150 36 |*******************109 40 |**************58 44 |*******26 48 |***32 52 |****13 56 |**29 60 |****14 64 |**4 68 |*2 72 |0 76 |0 80 |0 84 |0 88 |1 92 |1 96 |

+++++++++++++++++++0 80 160 240 320 400

Histogram frequency

Mean 36.294 Std err .354 Median 35.000Mode 30.000 Std dev 10.222 Variance 104.482

Valid cases 832 Missing cases 61

Figure 82: Duration of Questionnaire Session, Showing Histogram and Associated Summary Statistics

Total duration of PISA in the schoolThe mean duration from the entrance to the testroom to the exit of the last student was 205 min-utes (about three and one-half hours), with amedian of 200 minutes and mode of 195 minutes.There were noteworthy extremes. At one extreme,the entire PISA exercise took 75 minutes—in schoolsusing the SE booklet, who were only required tocomplete one hour of testing. At the other extreme,some schools took more than 300 minutes (overfive hours). These long periods were due toextended breaks between the two parts of thetest, or between the testing and questionnairesessions (in some cases as long as 90 minutes),or both. The very long breaks were mostly takenin only one country.

Preparation for the Assessment

Conditions for the test and questionnairesessions

Generally, the data from the SQM reportssuggest that the preparations for the PISAsessions went well and test room conditionswere adequate. Most of the 7 per cent of caseswhere SQMs commented that conditions wereless than adequate were because the testing roomwas too small.

Introducing the studyA script for introducing PISA to the students wasprovided in the Test Administrator’s Manual.While a small amount of latitude was permittedto suit local circumstances, e.g., in the way

PIS

A 2

00

0 t

ech

nic

al r

epor

t

272

Count Midpoint One symbol equals approximately 8.00 occurrences

0 101 |1 107 |1 113 |2 119 |0 125 |3 131 |

15 137 |**83 143 |**********

235 149 |*****************************195 155 |************************126 161 |****************59 167 |*******39 173 |*****35 179 |****15 185 |**4 191 |*0 197 |0 203 |0 209 |2 215 |0 221 |

+++++++++++0 80 160 240 320 400

Histogram frequency

Mean 155.598 Std err .389 Median 154.000Mode 150.000 Std dev 11.111 Variance 123.445

Valid cases 815 Missing cases 78

Figure 83: Duration of Total Student Working Time, with Histogram and Associated Summary Statistics

encouragement was given to students concerningthe importance of participating in PISA, TAs werereminded to follow the script as closely as feasible.From the SQM reports, it appears that TAs tookthis advice seriously. Almost 80 per cent of thosevisited by SQMs followed the script exactly, withabout 18 per cent making minor adjustmentsand only 2 per cent reported as paraphrasing thescript in a major way. While the number ofmajor adjustments is small (about 20 cases), it isenough to assume that insufficient emphasis mayhave been given to this aspect in the TA training.

Beginning the Testing Sessions

Of more importance is the accuracy with whichthe TAs read the instructions for the first andsecond parts of the testing session. Question 12in the box below pertained to the first part ofthe session.

Almost all TAs followed the script of theCognitive Blocks section of the manual forgiving the directions, the examples and theanswers to the examples, prior to beginning theactual test. The number of major additions anddeletions would be a concern if they pertained tothese aspects, but, given that they mostly relatedto whether students should stay in the testingroom after finishing and were spread throughabout half the countries, it is unlikely thesedepartures from the script would have had mucheffect on achievement, except possibly in one

country where no script was provided to TAs.Corresponding statistics for the second part of

the testing session showed that a higherpercentage (83) of TAs followed the instructionsverbatim, with smaller numbers of adjustmentsto the script. Numbers of major adjustmentswere about the same as for the first part of thesession. Again, adjustments to the script usuallyinvolved paraphrasing, but this occurred oftenenough that it seems that not enough emphasiswas put during TA training on the need foradhering accurately to the script.

Conducting the Testing Sessions

In assessing how TAs recorded information,monitored the students, and ended the first andsecond cognitive block sessions, the SQMsoverwhelmingly judged that the sessions wereconducted according to PISA procedures.Information on timing and attendance was notrecorded in only five cases. About 5 per cent ofTAs were reported as remaining at their desksduring the session rather than walking around tomonitor students as they worked, but somecommented that the students were so wellbehaved that there was no need to walk around.In all but six instances in the first session and 14in the second, students stopped work promptlywhen the TA said it was time to stop.

At the end of the first and second parts of thetesting session, close to 85 per cent of the TAs

Ap

pen

dic

es

273

Q 12 Did the Test Administrator read the directions, the examples and the answer to the examples fromthe Script of the Cognitive Blocks exactly as it is written?

Yes 705 (79.8%)

No 178 (20.2%) If no, did the Test Administrator make:

Minor additions 136 (15.2%)

Major additions 23 (2.6%)

Minor deletions 81 (9.1%)

Major deletions 11 (1.2%)

Missing 10 (1.1%)

If major additions or deletions were made, please describe:

No explanations were given by the SQMs for the major deletions, but it is likely that they arosewhen Test Administrators paraphrased instructions. Nearly all of the major deletions occurredbecause the Test Administrator said that students could leave when they had finished, whenthe instructions said they should remain in the room. This occurred in 17 different countries.

were reported as having followed the scriptexactly as it was written. Other cases involvedsome paraphrasing of the script (e.g., remindingstudents to complete the Calculator Use questionat the end of their test booklet), and all but avery small number were considered to be minor.Of serious concern, however, is the countrywhere no script was provided to TAs.

The Student Questionnaire Session

Although TAs and their assistants wereencouraged to assist students in answeringquestions in the questionnaire, about the samepercentage as for the cognitive sessions read thescript exactly as it was written and seemed toperceive the session as rule-bound. In about 100cases the TAs paraphrased the script, but in afurther 20 cases the TAs substituted their owninstructions entirely. In some of these instances(both types), the important instructions thatstudents could ask for help if they haddifficulties and that their answers would not beread by school staff were missed. Theseomissions represent a serious deviation fromPISA procedures and are thus a concern inrelation to the accuracy of students’ responses.

Most TAs realised that time constraints for

completing the questionnaire did not need to berigorously imposed. Because of this flexibility, thetiming procedures were less closely monitoredthan they were for the cognitive blocks.

General Questions Concerning theAssessment

This section of the appendix describes generalissues concerning how PISA sessions wereconducted, such as:• frequency of interruptions;• orderliness of the students;• evidence of cheating;• how appropriately the TA answered

questions;• whether any defective books were detected

and whether their replacement was handledappropriately;

• whether any late students were admitted; and• whether students refused to participate.

The SQMs’ responses to some of these areshown in the boxes below and some commentsare made following the boxes.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

274

Q 28 Were there any general disruptions to the session that lasted for more than one minute(e.g., alarms, announcements, changing of classes, etc.)?

Yes 99 (11.1%) If yes, specify.

Most disruptions occurred because of noise—either from bells ringing, musiclessons, the playground, or students changing classes.

No 792 (88.9%)

Missing 2 (0.2%)

Q 29 Generally, were the students orderly and cooperative?

Yes 836 (94.6%)

No 48 (5.4%)

Missing 9 (1.0%)

These data suggest that very high proportionsof students who participated in the PISA testsand questionnaires did so in a positive spirit.Students refused to participate after the sessionhad started in only a small percentage ofsessions. While cheating was reported to haveoccurred in 42 of the observed sessions, it isdifficult to imagine why students would try tocheat given that they knew that they haddifferent test booklets. This item was included inthe SQM observation sheet because there wererelatively high levels of cheating reported in thefield trial. It is unlikely that the cheating offencesreported by the SQMs would have had muchinfluence on country results and rankings,though they possibly affected the results of a fewindividual students.

More than 95 per cent of TAs were reportedas responding to students’ questionsappropriately during the test and questionnaireadministration.

In 28 cases, TAs were observed to admit latestudents after the testing had begun, which wasclearly stated in their manual as unacceptable.Though low in incidence, this represented aviolation of PISA procedures and may not havereceived sufficient emphasis in TA training.

Defective test booklets were detected in closeto 100 (about 11 per cent) of the SQM visits.These were replaced appropriately in half the

cases, but in the other half the correct procedurewas not followed. In five cases, where there were10 or more defective booklets, the TA may nothave had enough spare copies to distributeaccording to the specified procedure, thoughsubsequent investigation showed that in two ofthese cases the booklets were replaced correctly.In the majority of cases, only one or twobooklets were defective. (Items that were missingfrom test booklets or questionnaires were codedas not administered and therefore did not affectresults.)

Summary

The data from the SQMs support the view that,in nearly all of the schools that were visited,PISA test and questionnaire sessions wereconducted in an orderly fashion and conformedto PISA procedures. The students also seem tohave participated with good grace judging by thelow level of disruption and refusal. This givesconfidence that the PISA test data were collectedunder near-optimal conditions.

Interview with the PISA SchoolCo-ordinator

School Quality Monitors conducted an interviewwith the PISA School Co-ordinators to try and

Ap

pen

dic

es

275

Q 36 Did any students refuse to participate in the assessment after the session had begun?

Yes 48 (5.4%)

No 837 (94.6%)

Missing 8 (0.9%)

Q 32 Was there any evidence of students cheating during the assessment session?

Yes 42 (4.7%)

If yes, describe this evidence.

Most of the evidence to do with cheating concerned students exchanging booklets, askingeach other questions and showing each other their answers. Some students checked with theirneighbour to see if they had the same test booklets. Most incidences reported by SchoolQuality Monitors seem to have been relatively minor infringements.

No 847 (95.3%)

Missing 4 (0.4%)

establish problems experienced by schools inconducting PISA, including: • How well the PISA activities went at the

school;• How many schools refused to have a make-up

session;• If there were any problems co-ordinating the

schedule for the assessment;• If there were any problems preparing an

accurate list of age-eligible students;• If there were any problems in making

exclusions of sampled students;• If the students received practice on PISA-like

questions before the PISA assessment; and• How many students refused to participate in

the PISA tests prior to testing.SQMs were advised that they should treat the

interview as a fairly informal ‘conversation’,since it was judged that this would be less likelyto be seen as an imposition on teachers. Thequality of the data from these interviews wasgood, judging by their completeness and detail.

Responses from the interviews aresummarised in Table 90.

The PISA activities appeared to have gonewell in the vast majority of schools visited by theSQMs. The problems that did exist related toclashes with examinations or the curriculum, orlack of time to organise the testing. Given thedemands of the sampling procedures on manyschools, the relatively long time during whichstudents were needed, and the possibility that insome countries the Student Questionnaire couldbe seen as controversial, this result is a verypositive endorsement of the procedures put inplace by National Centres.

Only 10 of the almost 900 schools that SQMsvisited refused to conduct a make-up session ifone was required. Generally, the schools visitedhad no difficulties co-ordinating the schedule forthe assessment. This may suggest that the role ofthe School Co-ordinator was important inhelping to place PISA into the school, and that itsuccessfully achieved its objectives.

It was expected that preparing the list of age-eligible students might be a problem for schools,particularly those without computer-basedrecords. However, only a small minority of theschools visited by the SQMs said that they hadexperienced problems.

It had also been anticipated that excludingsampled students might be a problem forschools. The evidence here shows that only asmall minority of the schools visited by theSQMs had problems. This finding is consistentwith the NCQM reports where only five NPMsanticipated that schools would have problems.(However, it is also possible that exclusions wereseen not to be problematic because they werenot taken seriously. Caution is therefore advisedin interpreting these results.)

Very few schools had used questions similarto those found in PISA for practice before assess-ment day. This is a positive finding since it impliesthat rehearsal effects were probably minimal.

Student refusals prior to the testing day werewidely distributed among schools, with onenotable exception, as shown in Table 91. Furtherinvestigation showed that this school was in acountry where a national option meant that morethan 35 students could be sampled. All otherschools from this country had low levels of refusal.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

276

Table 90: Responses to the School Co-ordinator Interview

Interview question Yes (%) No (%) Missing (%)

In your opinion, did the PISA activities go well at your school? 91 5 4

Were there any problems co-ordinating the schedule for 11 85 4the assessment?

Did you have any problems preparing an accurate list of age- 7 90 3eligible students?

Did you have any problems making exclusions after you 5 90 5received the list of sampled students?

Did the students get any practice on questions like those in 4 91 5PISA before their assessment day?

Did any students refuse to take the test prior to the testing? 22 76 2

Ap

pen

dic

es

277

Staffing

Test AdministratorsThree-quarters of the NPMs thought that it waseasy to find TAs as described by the PISAguidelines. Of the remaining countries, twostated that they would have problems because ofa general lack of qualified people interested inthe job. Two countries reported that budgetconcerns made it difficult to find appropriatepeople. Five countries noted that it wasnecessary to use school staff, and that there was,therefore, a chance that the Test Administratormight be a teacher of some of the students.However, this was not expected to happen often.This is backed up by the responses to Question5, where no country expected to have to useteachers of sampled students routinely.

The National Centre or NPM in 25 countriesdirectly or indirectly trained the TAs. Twocountries contracted this training out. Sometimesthe National Centre would only train regionalstaff, who would then train the TAs workinglocally.

Table 91: Numbers of Students per SchoolRefusing to Participate in PISA Prior to Testing

Number of Students Per School Frequency

1 882 473 184 145 76 77 49 410 211 112 144 1

National Centre Quality MonitorReports

NCQMs interviewed 29 NPMs. This report doesnot include Australia, Japan or the United States,where the Consortium was responsible in wholeor in part for PISA activities. Seven NCQMscollected the data. Most interviews lasted abouttwo hours.

The International Project Centre has established the following criteria for Test Administrators:

It is required that the Test Administrator NOT be the reading, mathematics, or science instructor of any studentsin the sessions he or she will administer for PISA;

It is recommended that the Test Administrator NOT be a member of the staff of any school where he or she willadminister PISA; and

It is preferable that the Test Administrator NOT be a member of the staff of any school in the PISA sample.

Q 4 How easy was it to comply with these criteria? If NOT easy, why not?Easy 20Not Easy 8No Response/NA 1

Q 5 Who will be Test Administrators for the main study?NPM/National Centre staff 2Regional/District 4External contractor 7Teacher of students 0Other school staff 9Other 7

School Quality Monitors

Several countries found it difficult to recruitSQMs primarily because the NPM believed thatthe Consortium had not provided an adequatejob description. Three countries noted that itwas difficult to find interested people that spokeeither English or French, and a further threecountries noted that it was hard to find qualifiedpeople interested in or independent of thenational centre.

MarkersCountries planned to use several sources for recruit-ing markers. Most planned to use teachers or re-tired teachers who were familiar with 15-year-oldsand the subject matter taught to them. Thirteencountries intended to use university students(mainly graduate-level in education). Five countriesplanned to use national centre staff. A few countriesplanned on using scientists or research staff asmarkers.

Administrative Issues

Obtaining permission from schoolsThirteen NPMs indicated that it had beendifficult to obtain schools’ participation in PISAin their country. Four said that money or in-kindcontributions to the school were used toencourage participation. In one country speecheswere made to teacher and parent groups, and inanother a national advertisement to promotePISA was used. Several countries providedbrochures and other information to schools, andat least one country used a public website.Letters of support from Ministries of Educationwere noted to be useful in some instances.Telephone calls and personal meetings withschool and regional staff were also common.

Unnanounced visits by School QualityMonitors

PISA procedures for School Quality Monitors’visits to schools required that the visits beunannounced. This was expected to pose problemsin several countries, but in practice it was aproblem in only three, where unannounced visitsto schools are not permitted.

Security and confidentialityAll countries took the issue of confidentialityseriously. Nearly all required a pledge ofconfidentiality and provided instructions onmaintaining confidentiality and security. Of the

PIS

A 2

00

0 t

ech

nic

al r

epor

t

278

Q 14 How easy was it to find School QualityMonitors?

Easy 15 Not Easy 13 If NOT easy, why not?

Need better job description 6Lack English/French 3Qualified – no interest/time 2Qualified – not independentof national centre 1

The markers of PISA tests:

• must have a good understanding of mid-secondary level mathematics and science OR <test language>; and

• understand secondary level students and the ways that students at this level express themselves.

Q 49 Do you anticipate or did you have any problems recruiting the markers who will meet thesecriteria? If yes, what are these problems?

Yes 8 Seven of these ‘yes’ responses were due to a ‘lack of qualified/available staff’. Two countries cited cost or other factors as reasonsfor having difficulties recruiting markers.

No 19No response/NA 2

Q 50 Who will be the markers (that is, from what pool will the markers be recruited)?First Source Second

Third National Centre staff 0 5 0 Teachers/Professional Educators 25 0 2 University students 3 1 1 Scientists/researchers 1 1 2 Other 0 1 1 No response/NA 0 0 0

29 respondents, 16 intended to send schoolpackages of materials direct to TAs, nine wereplanning to send them to the relevant school,and four had arranged to send them to regionalministry or school inspectors’ premises forcollection by the TA.

Most countries said that the materials couldnot be inventoried without breaking the seal,either because they were wrapped in paper ormore commonly were sealed in boxes withinstructions not to open them until testing.These countries had not followed the suggestionin the NPM Manual that packages be heat-sealed in clear plastic, allowing booklets to becounted without the seal being broken.

Student Tracking FormsStudent Tracking Forms are among the mostimportant administrative tools in PISA. It wasimportant to know if NPMs, TAs and SCs foundthem amenable to use. Replies pointed to a needto simplify the coding scheme for documentingstudent exclusions and to give thought totranslation issues.

Translation and VerificationThe quality of data from a study like PISAdepends considerably on the quality of theinstruments, which in turn depend largely on thequality of translation. It was very important to theproject to know that National Centres were ableto follow the translation guidelines and proceduresappropriately. By and large the translation qualityobjectives were met, as shown by the NPMs’responses to the following questions.Eight countries listed time pressure as the mainproblem with international verification. Two

Ap

pen

dic

es

279

Q 12 What are you doing to keep the materialssecure?

At the National Centre All countries reported goodsecurity including confidentialitypledges, locked rooms/securedbuildings, and instructions aboutconfidentiality and security. All ofthe National Centres were usedto dealing with embargoed andsecure material.

While in the possession of the TestAdministrator

Most countries used staff whowere familiar with test securityand confidentiality. This issue wasalso emphasised at TestAdministrator training.

While in the possession of the SchoolCo-ordinator

In these cases, the material wassealed and the School Co-ordinator given instructions onhandling the material. Materialswere not to be opened until themorning of testing.

Q 17 Does the Student Tracking Form providedby the Consortium meet the needs inyour country?

Yes 22 No 7

If no, how did you modify the form to better meetyour needs?

Three countries indicated that they hadproblems entering the exclusions. Otherproblems mentioned were: problems trans-lating the text into the country’s lang-uage, problems adding a tenth bookletand problems adding additional columns.One country thought the process wastoo complicated in general.

Q 28 Were the guidelines (for translation andverification) clear?

Yes 24 No 1 No response/NA 4

Q 30 How much did the internationalverification add to the workload in themain study? If a lot, explain.

Not much 17 A lot 10 No response/NA 2

Q 33 Did the remarks of the verifiersignificantly improve the national versionfor the field trial? If no, what were theproblems?

Yes 14No 11No response/NA 4

PIS

A 2

00

0 t

ech

nic

al r

epor

t

280

commented on the cost of verification procedures.The 11 NPMs who said that there was nosignificant improvement in the national versionfor the field trial also noted that they hadexperienced few initial problems. No countryfelt that the verifiers had made errors, but atleast one mentioned that they did not alwaysagree with or accept the verifier’s suggestions.

Slightly fewer countries indicated that therewere significant improvements in their nationalversions for the main study. Again, the reason

expressed by ten of the countries whoresponded that significant improvements werenot made was that not many changes weremade in any case. This is what would be hopedafter the experiences of the field trial. However,a small number of countries did indicate someproblems.

Adequacy of the Manuals

One aim of the NCQM visits was to collectsuggestions from NPMs about areas in which thevarious PISA manuals could be improved.Seventeen of the NPMs offered comments, whichare tabulated in the box below.

The format of some of the manuals clearlyneeds to be simplified. A problem for many NPMswas keeping track of and following revisions,especially while translations were in progress.

Q 34 Did the remarks of the verifiersignificantly improve the nationalversion for the main study? If no, whatwere the problems?

Yes 12No 13No response/NA 4

Q 55 Do you have any comments about manuals and other materials at this time?

NPM Manual Thought manual good/adequate = 7Comments offered=17 Concerned about the lateness of receiving manual = 2No comments offered=12 Manual needs to be simplified or made clearer = 8

SC Manual

Comments offered=17 Thought manual good/adequate = 5No comments offered=12 Manual needs to be simplified or made clearer = 11

Manual needs checklist or more examples (especially non-English/French) = 1

TA Manual

Comments offered=16 Thought manual good/adequate=4No comments offered=13 Manual needs to be simplified or made clearer=11

Manual needs checklist or more examples (especially non-English/French) = 1

Marking Guide

Comments offered=16 Thought Guide good/adequate = 11No comments offered=13 Guide needs to be simplified or made clearer = 2

Guide needs checklist or more examples (especially non-English/French) = 1Other comments = 2

Sampling Manual

Comments offered=16 Thought manual good/adequate = 12No comments offered=13 Manual needs to be simplified or made clearer = 4

Translation Guidelines

Comments offered=11 Thought Guidelines good/adequate = 10No comments offered18 Guidelines need to be simplified or made clearer = 1

Other

Comments offered=6 Thought good/adequate = 2No comments offered=23 Manual needs to be simplified or made clearer = 4

(Other comments referred to the KeyQuest Manual and were noted earlierin the report in the section on the Student Tracking Form and sampling.)

Ap

pen

dic

es

281

Conclusion

In general, the NCQM reports point to a strongorganisational base for the conduct of PISA.National Centres were, in the main, wellorganised and had a good understanding of thefollowing aspects:• appointing and training Test Administrators;• appointing SQMs;• organising for marking and appointing markers;• security of PISA materials;• the Student Tracking Form and sampling

issues; and• translation (where appropriate) and verification

of materials, although concerns were expressedabout how much work was involved and somedisagreement with decisions made during thisprocess was indicated.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

282

Appendix 6: Order Effects Figures

Table 92: Item Assignment to Reading Clusters

Cluster

No. 1 2 3 4 5 6 7 8 9

1 R219Q01T R245Q01 R091Q05 R227Q01 R093Q03 R061Q01 R228Q01 R110Q01 R077Q02

2 R219Q01E R245Q02 R091Q06 R227Q02T R055Q01 R061Q03 R228Q02 R110Q04 R077Q03

3 R219Q02 R225Q02 R091Q07B R227Q03 R055Q02 R061Q04 R228Q04 R110Q05 R077Q04

4 R081Q01 R225Q03 R119Q09T R227Q04 R055Q03 R061Q05 R099Q04B R110Q06 R077Q05

5 R081Q05 R225Q04 R119Q01 R227Q06 R055Q05 R083Q01 R120Q01 R237Q01 R077Q06

6 R081Q06A R238Q01 R119Q07 R086Q05 R122Q02 R083Q02 R120Q03 R237Q03 R216Q01

7 R081Q06B R238Q02 R119Q06 R086Q07 R122Q03T R083Q03 R120Q06 R236Q01 R216Q02

8 R101Q01 R234Q01 R119Q08 R086Q04 R067Q01 R083Q04 R120Q07T R236Q02 R216Q03T

9 R101Q02 R234Q02 R119Q04 R102Q01 R067Q04 R083Q06 R220Q01 R246Q01 R216Q04

10 R101Q03 R241Q02 R119Q05 R102Q04A R067Q05 R100Q04 R220Q02B R246Q02 R216Q06

11 R101Q04 R102Q05 R076Q03 R100Q05 R220Q04 R239Q01 R088Q01

12 R101Q05 R102Q06 R076Q04 R100Q06 R220Q05 R239Q02 R088Q03

13 R101Q08 R102Q07 R076Q05 R100Q07 R220Q06 R088Q04T

14 R070Q02 R111Q01 R104Q01 R088Q05T

15 R070Q03 R111Q02B R104Q02 R088Q07

16 R070Q04 R111Q04 R104Q05 R040Q02

17 R070Q07T R111Q06B R104Q06 R040Q03A

18 R040Q03B

19 R040Q04

20 R040Q06

Table 92 shows the assignment of items to each ofthe nine reading clusters. Figures 84 to 92 representthe complete set of plots of order effects in thebooklets. There is one figure for each cluster. In thelegend, the first digit after ‘Pos.’ shows whether the

cluster was in the first or second half of thebooklet, and the second digit shows the cluster’sposition in the booklet half. For example, Pos. 2.1means that the cluster was the first one in thesecond half of the booklet indicated in brackets.

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Item

Par

amet

er D

iffe

renc

es


Item

Figure 84: ItemParameter Differencesfor the Items inCluster One

Ap

pen

dic

es

283

-0.4-0.3-0.2-0.1

00.10.20.30.4

1 2 3 4 5 6 7 8 9 10Item


Item

Par

amet

er D

iffe

renc

es

Figure 85: ItemParameter Differencesfor the Items inCluster Two

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

1 2 3 4 5 6 7 8 9 10Item


Item

Par

amet

er D

iffe

renc

es

-0.4

-0.3-0.2

-0.1

0

0.10.2

0.3

0.4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17Item


Item

Par

amet

er D

iffer

ence

s

Figure 86: ItemParameter Differencesfor the Items inCluster Three

Figure 87: ItemParameter Differencesfor the Items inCluster Four

284

Figure 88: ItemParameter Differencesfor the Items inCluster Five

Figure 89: ItemParameter Differencesfor the Items inCluster Six

Figure 90: ItemParameter Differencesfor the Items inCluster Seven

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 12 13

Item


Item

Par

amet

er D

iffe

renc

es

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17Item


Item

Par

amet

er D

iffe

renc

es

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

1 2 3 4 5 6 7 8 9 10 11 12 13Item


Item

Par

amet

er D

iffe

renc

es

PIS

A 2

00

0 t

ech

nic

al r

epor

t

Ap

pen

dic

es

285

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

1 2 3 4 5 6 7 8 9 10 11 12Item


Item

Par

amet

er D

iffe

renc

e

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Item

Pos. 2.1 (book 9) Pos 2.2 (book 8)

Item

Par

amet

er D

iffe

renc

e

Figure 91: ItemParameter Differencesfor the Items inCluster Eight

Figure 92: ItemParameter Differencesfor the Items inCluster Nine

PISA Consortium

Australian Council for EducationalResearch

Ray Adams (Project Director of the PISA Consortium)

Alla Berezner (data processing, data analysis)

Claus Carstensen (data analysis)

Lynne Darkin (reading test development)

Brian Doig (mathematics test development)

Adrian Harvey-Beavis (quality monitoring,questionnaire development)

Kathryn Hill (reading test development)

John Lindsey (mathematics test development)

Jan Lokan (quality monitoring, field proceduresdevelopment)

Le Tu Luc (data processing)

Greg Macaskill (data processing)

Joy McQueen (reading test development and reporting)

Gary Marks (questionnaire development)

Juliette Mendelovits (reading test development andreporting)

Christian Monseur (Director of the PISA Consortiumfor data processing, data analysis, quality monitoring)

Gayl O’Connor (science test development)

Alla Routitsky (data processing)

Wolfram Schulz (data analysis)

Ross Turner (test analysis and reporting co-ordination)

Nikolai Volodin (data processing)

Craig Williams (data processing, data analysis)

Margaret Wu (Deputy Project Director of the PISAConsortium)

Westat

Nancy Caldwell (Director of the PISA Consortiumfor field operations and quality monitoring)

Ming Chen (sampling and weighting)

Fran Cohen (sampling and weighting)

Susan Fuss (sampling and weighting)

Brice Hart (sampling and weighting)

Sharon Hirabayashi (sampling and weighting)

Sheila Krawchuk (sampling and weighting)

Dward Moore (field operations and quality monitoring)

Phu Nguyen (sampling and weighting)

Monika Peters (field operations and quality monitoring)

Merl Robinson (field operations and quality monitoring)

Keith Rust (Director of the PISA Consortium forsampling and weighting)

Leslie Wallace (sampling and weighting)

Dianne Walsh (field operations and quality monitoring)

Trevor Williams (questionnaire development)

Citogroep

Steven Bakker (science test development)

Bart Bossers (reading test development)

Truus Decker (mathematics test development)

Erna van Hest (reading test development andquality monitoring)

Kees Lagerwaard (mathematics test development)

Gerben van Lent (mathematics test development)

Ico de Roo (science test development)

Maria van Toor (office support and qualitymonitoring)

Norman Verhelst (technical advice, data analysis)

Educational Testing Service

Irwin Kirsch (reading test development)

Other experts

Marc Demeuse (quality monitoring)

Harry Ganzeboom (questionnaire development)

Aletta Grisay (technical advice, data analysis,translation, questionnaire development)

Donald Hirsch (editorial review)

Erich Ramseier (questionnaire development)

Marie-Andrée Somers (data analysis and reporting)

Peter Sutton (editorial review)

Rich Tobin (questionnaire development and reporting)

J. Douglas Willms (questionnaire development,data analysis and reporting)

Reading FunctionalPIS

A 2

00

0 t

ech

nic

al r

epor

t

286

Appendix 7: PISA 2000 Expert Group Membership andConsortium Staff

Expert Group

Irwin Kirsch (Chair) (Educational Testing Service,United States)

Marilyn Binkley (National Center for EducationalStatistics, United States)

Alan Davies (University of Edinburgh, UnitedKingdom)

Stan Jones (Statistics Canada, Canada)

John de Jong (Language Testing Services, TheNetherlands)

Dominique Lafontaine (Université de Liège SartTilman, Belgium)

Pirjo Linnakylä (University of Jyväskylä, Finland)

Martine Rémond (Institut National de RecherchePédagogique, France)

Ryo Watanabe (National Institute for EducationalResearch, Japan)

Prof. Wolfgang Schneider (University of Würzburg,Germany)

Mathematics FunctionalExpert Group

Jan de Lange (Chair) (Utrecht University, TheNetherlands)

Raimondo Bolletta (Istituto Nazionale di Valutazione,Italy)

Sean Close (St Patrick’s College, Ireland)

Maria Luisa Moreno (IES “Lope de Vega”, Spain)

Mogens Niss (IMFUFA, Roskilde University,Denmark)

Kyungmee Park (Hongik University, Korea)

Thomas A. Romberg (United States)

Peter Schüller (Federal Ministry of Education andCultural Affairs, Austria)

Science Functional ExpertGroup

Wynne Harlen (Chair) (University of Bristol,United Kingdom)

Peter Fensham (Monash University, Australia)

Raul Gagliardi (University of Geneva, Switzerland)

Svein Lie (University of Oslo, Norway)

Manfred Prenzel (Universität Kiel, Germany)

Senta Raizen (National Center for ImprovingScience Education (NCISE), United States)

Donghee Shin (Dankook University, Korea)

Elizabeth Stage (University of California, United States)

Cultural Review Panel

Ronald Hambleton (Chair) (University ofMassachusetts, United States)

David Bartram (SHL Group plc, United Kingdom)

Sunhee Chae (KICE, Korea)

Chiara Croce (Ministero della Pubblica Istruzione,Direzione Generale degli Scambi Culturali, Italy)


Jose Muniz (University of Oviedo, Spain)

Yves Poortinga (Tilburg University, TheNetherlands)

Norbert Tanzer (University of Graz, Austria)

Pierre Vrignaud (INETOP - Institut National pourl’Enseignement Technique et l’OrientationProfessionnelle, France)

Ingemar Wedman (University of Stockholm,Sweden)

Technical Advisory Group

Geoff Masters (Chair) (ACER, Australia)

Ray Adams (ACER, Australia)

Pierre Foy (Statistics Canada, Canada)

Aletta Grisay (Belgium)

Larry Hedges (The University of Chicago, UnitedStates)

Eugene Johnson (American Institutes for Research,United States)


Keith Rust (WESTAT, USA)

Norman Verhelst (Cito group, The Netherlands)

J. Douglas Willms (University of New Brunswick,Canada)

Ap

pen

dic

es

287

PIS

A 2

00

0 t

ech

nic

al r

epor

t

288

Appendix 8: Contrast Coding for PISA-2000Conditioning Variables

Conditioning Variables Variable Name (s) Variable Coding Contrast Coding

Sex - Q3 ST03Q01 1 <Female> -12 <Male> 1Missing 0

Mother - Q4a ST04Q01 1 Yes 10 2 No 01 Missing 00

Female Guardian - Q4b ST04Q02 1 Yes 10 2 No 01 Missing 00

Father - Q4c ST04Q03 1 Yes 10 2 No 01 Missing 00

Male Guardian - Q4d ST04Q04 1 Yes 10 2 No 01 Missing 00

Brothers - Q4e ST04Q05 1 Yes 10 2 No 01 Missing 00

Sisters - Q4f ST04Q06 1 Yes 10 2 No 01 Missing 00

Grandparents - Q4g ST04Q07 1 Yes 10 2 No 01 Missing 00

Others - Q4h ST04Q08 1 Yes 10 2 No 01 Missing 00

Older - Q5a ST05Q01 1 None 10 2 One 20 3 Two 30 4 Three 40 5 Four or more 50 Missing 11

Younger - Q5b ST05Q02 1 None 10 2 One 20 3 Two 30 4 Three 40 5 Four or more 50 Missing 11

Same age - Q5c ST05Q03 1 None 10 2 One 20 3 Two 30 4 Three 40 5 Four or more 50 Missing 11

Ap

pen

dic

es

289


Mother currently doing - Q6 ST06Q01 1 Working full-time <for pay> 10002 Working part-time <for pay> 01003 Not working, but looking... 00104 Other (e.g., home duties, retired) 0001Missing 0000

Father currently doing - Q7 ST07Q01 1 Working full-time <for pay> 10002 Working part-time <for pay> 01003 Not working, but looking... 00104 Other (e.g., home duties, retired) 0001Missing 0000

Mother’s secondary educ ST12Q01 1 No, she did not go to school 10000 - Q12 2 No, she completed <ISCED

level 1> only 01000 3 No, she completed <ISCED level 2> only 00100 4 No, she completed <ISCED level 3B, 3C> only 00010 5 Yes, she completed <ISCED 3A> 00001 Missing 00000

Father’s secondary educ ST13Q01 1 No, he did not go to school 10000 - Q13 2 No, he completed <ISCED 1>

1> only 01000 3 No, he completed <ISCED 2> only 00100 4 No, he completed <ISCED 3B, 3C> only 00010 5 Yes, he completed <ISCED 3A> 00001 Missing 00000

Mother’s tertiary educ - Q14 ST14Q01 1 Yes 10 2 No 01 Missing 00

Father’s tertiary educ - Q15 ST15Q01 1 Yes 10 2 No 01 Missing 00

Country of birth, self - Q16a ST16Q01 1 <Country of Test> 10 2 Another Country 01 Missing 00

Country of birth, Mother ST16Q02 1 <Country of Test> 10 - Q16b 2 Another Country 01

Missing 00 Country of birth, Father ST16Q03 1 <Country of Test> 10 - Q16c 2 Another Country 01

Missing 00


Language at home - Q17 ST17Q01 1 <Test language> 10002 <Other official national

languages> 01003 <Other national dialects or

languages> 00104 <Other languages> 0001Missing 0000

Movies - Q18a ST18Q01 1 Never or hardly ever 10 2 Once or twice a year 20 3 About 3 or 4 times a year 30 4 More than 4 times a year 40 Missing 41

Art gallery - Q18b ST18Q02 1 Never or hardly ever 10 2 Once or twice a year 20 3 About 3 or 4 times a year 30 4 More than 4 times a year 40 Missing 11

Pop Music - Q18c ST18Q03 1 Never or hardly ever 10 2 Once or twice a year 20 3 About 3 or 4 times a year 30 4 More than 4 times a year 40 Missing 11

Opera - Q18d ST18Q04 1 Never or hardly ever 10 2 Once or twice a year 20 3 About 3 or 4 times a year 30 4 More than 4 times a year 40 Missing 11

Theatre - Q18e ST18Q05 1 Never or hardly ever 10 2 Once or twice a year 20 3 About 3 or 4 times a year 30 4 More than 4 times a year 40 Missing 11

Sport - Q18f ST18Q06 1 Never or hardly ever 10 2 Once or twice a year 20 3 About 3 or 4 times a year 30 4 More than 4 times a year 40 Missing 41

Discuss politics - Q19a ST19Q01 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11

PIS

A 2

00

0 t

ech

nic

al r

epor

t

290


Discuss books - Q19b ST19Q02 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 51

Listen classics - Q19c ST19Q03 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11

Discuss school issues ST19Q04 1 Never or hardly ever 10 - Q19d 2 A few times a year 20

3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 51

Eat <main meal> - Q19e ST19Q05 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 51

Just talking - Q19f ST19Q06 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 51

Mother - Q20a ST20Q01 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11

Father - Q20b ST20Q02 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11

Ap

pen

dic

es

291

PIS

A 2

00

0 t

ech

nic

al r

epor

t

292


Siblings - Q20c ST20Q03 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11

Grandparents - Q20d ST20Q04 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11

Other relations - Q20e ST20Q05 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11

Parents’ friends - Q20f ST20Q06 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11

Dishwasher - Q21a ST21Q01 1 Yes 10 2 No 01 Missing 00

Own room - Q21b ST21Q02 1 Yes 10 2 No 01 Missing 00

Educat software - Q21c ST21Q03 1 Yes 10 2 No 01 Missing 00

Internet - Q21d ST21Q04 1 Yes 10 2 No 01 Missing 00

Dictionary - Q21e ST21Q05 1 Yes 10 2 No 01 Missing 00

Study place - Q21f ST21Q06 1 Yes 10 2 No 01 Missing 00

Desk - Q21g ST21Q07 1 Yes 10 2 No 01 Missing 00

Ap

pen

dic

es

293


Text books - Q21h ST21Q08 1 Yes 10 2 No 01 Missing 00

Classic literature - Q21i ST21Q09 1 Yes 10 2 No 01 Missing 00

Poetry - Q21j ST21Q10 1 Yes 10 2 No 01 Missing 00

Art works - Q21k ST21Q11 1 Yes 10 2 No 01 Missing 00

Phone - Q22a ST22Q01 1 None 10 2 One 20 3 Two 30 4 Three or more 40 Missing 21

Television - Q22b ST22Q02 1 None 10 2 One 20 3 Two 30 4 Three or more 40 Missing 41

Calculator - Q22c ST22Q03 1 None 10 2 One 20 3 Two 30 4 Three or more 40 Missing 41

Computer - Q22d ST22Q04 1 None 10 2 One 20 3 Two 30 4 Three or more 40 Missing 21

Musical instruments - Q22e ST22Q05 1 None 10 2 One 20 3 Two 30 4 Three or more 40 Missing 11

Car - Q22f ST22Q06 1 None 10 2 One 20 3 Two 30 4 Three or more 40 Missing 21

Bathroom - Q22g ST22Q07 1 None 10 2 One 20 3 Two 30 4 Three or more 40 Missing 21

PIS

A 2

00

0 t

ech

nic

al r

epor

t

294


<Extension> - Q23a ST23Q01 1 No, never 10 2 Yes, sometimes 20 3 Yes, regularly 30 Missing 11

<Remedial> in <test lang> ST23Q02 1 No, never 10 - Q23b 2 Yes, sometimes 20

3 Yes, regularly 30 Missing 11

<Remedial> in other subjects ST23Q03 1 No, never 10 - Q23c 2 Yes, sometimes 20


Skills training - Q23d ST23Q04 1 No, never 10 2 Yes, sometimes 20 3 Yes, regularly 30 Missing 11

In <test language> - Q24a ST24Q01 1 No, never 10 2 Yes, sometimes 20 3 Yes, regularly 30 Missing 11

In other subjects - Q24b ST24Q02 1 No, never 10 2 Yes, sometimes 20 3 Yes, regularly 30 Missing 11

<Extension> - Q24c ST24Q03 1 No, never 10 2 Yes, sometimes 20 3 Yes, regularly 30 Missing 11

<Remedial> in<test language> ST24Q04 1 No, never 10 - Q24d 2 Yes, sometimes 20


<Remedial> in other subjects ST24Q05 1 No, never 10 - Q24e 2 Yes, sometimes 20


Skills training - Q24f ST24Q06 1 No, never 10 2 Yes, sometimes 20 3 Yes, regularly 30 Missing 11

<Private tutoring> - Q24g ST24Q07 1 No, never 10 2 Yes, sometimes 20 3 Yes, regularly 30 Missing 11

Ap

pen

dic

es

295


School program - Q25 ST25Q01 1 <ISCED 2A> 1000002 <ISCED 2B> 0100003 <ISCED 2C> 0010004 <ISCED 3A> 0001005 <ISCED 3B> 0000106 <ISCED 3C> 000001Missing 000000

Teachers wait long time ST26Q01 1 Never 10 - Q26a 2 Some lessons 20

3 Most lessons 30 4 Every lesson 40 Missing 21

Teachers want students to ST26Q02 1 Never 10 work - Q26b 2 Some lessons 20


Teachers tell students do ST26Q03 1 Never 10 better - Q26c 2 Some lessons 20


Teachers don’t like - Q26d ST26Q04 1 Never 10 2 Some lessons 20 3 Most lessons 30 4 Every lesson 40 Missing 21

Teachers show interest ST26Q05 1 Never 10 - Q26e 2 Some lessons 20


Teachers give opportunity ST26Q06 1 Never 10 - Q26f 2 Some lessons 20


Teachers help with work ST26Q07 1 Never 10 - Q26g 2 Some lessons 20


Teachers continue teaching ST26Q08 1 Never 10 - Q26h 2 Some lessons 20


PIS

A 2

00

0 t

ech

nic

al r

epor

t

296


Teachers do a lot to ST26Q09 1 Never 10 help - Q26i 2 Some lessons 20


Teachers help with learning ST26Q10 1 Never 10 - Q26j 2 Some lessons 20


Teachers check homework ST26Q11 1 Never 10 - Q26k 2 Some lessons 20


Students cannot work well ST26Q12 1 Never 10 - Q26l 2 Some lessons 20


Students don’t listen - Q26m ST26Q13 1 Never 10 2 Some lessons 20 3 Most lessons 30 4 Every lesson 40 Missing 21

Students don’t start - Q26n ST26Q14 1 Never 10 2 Some lessons 20 3 Most lessons 30 4 Every lesson 40 Missing 21

Students learn a lot - Q26o ST26Q15 1 Never 10 2 Some lessons 20 3 Most lessons 30 4 Every lesson 40 Missing 31

Noise & disorder - Q26p ST26Q16 1 Never 10 2 Some lessons 20 3 Most lessons 30 4 Every lesson 40 Missing 21

Doing nothing - Q26q ST26Q17 1 Never 10 2 Some lessons 20 3 Most lessons 30 4 Every lesson 40 Missing 21

Ap

pen

dic

es

297


Number in < test language> ST27Q01 1 010- Q27a 2 020

… …90 900Missing 041

Usual in <test lang> ST27Q02 1 Yes 10- Q27aa 2 No 01

Missing 00Number in Mathematics ST27Q03 1 010

- Q27b 2 020… …90 900Missing 041

Usual in Mathematics ST27Q04 1 Yes 10- Q27bb 2 No 01

Missing 00Number in Science - Q27c ST27Q05 1 010

2 020… …90 900Missing 041

Usual in Science - Q27cc ST27Q06 1 Yes 102 No 01Missing 00

In < test language> - Q28a ST28Q01 1 0102 020… …90 900Missing 251

In Mathematics - Q28b ST28Q02 1 0102 020… …90 900Missing 241

In Science - Q28c ST28Q03 1 0102 020… …90 900Missing 241

Miss school - Q29a ST29Q01 1 None 10 2 1 or 2 20 3 3 or 4 30 4 5 or more 40 Missing 11

PIS

A 2

00

0 t

ech

nic

al r

epor

t

298


<Skip> classes - Q29b ST29Q02 1 None 10 2 1 or 2 20 3 3 or 4 30 4 5 or more 40 Missing 11

Late for school - Q29c ST29Q03 1 None 10 2 1 or 2 20 3 3 or 4 30 4 5 or more 40 Missing 11

Well with teachers - Q30a ST30Q01 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31

Interested in students - Q30b ST30Q02 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31

Listen to me - Q30c ST30Q03 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31

Give extra help - Q30d ST30Q04 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31C

Treat me fairly - Q30e ST30Q05 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31

Feel an outsider - Q31a ST31Q01 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 11

Make friends - Q31b ST31Q02 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31

Ap

pen

dic

es

299


Feel I belong - Q31c ST31Q03 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31

Feel awkward - Q31d ST31Q04 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 11

Seem to like me - Q31e ST31Q05 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31

Feel lonely - Q31f ST31Q06 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 11

Don’t want to be - Q31g ST31Q07 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 21

Feel Bored - Q31h ST31Q08 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 21

I complete on time - Q32a ST32Q01 1 Never 10 2 Sometimes 20 3 Most of the time 30 4 Always 40 Missing 31

I do watching TV - Q32b ST32Q02 1 Never 10 2 Sometimes 20 3 Most of the time 30 4 Always 40 Missing 21

Teachers grade - Q32c ST32Q03 1 Never 10 2 Sometimes 20 3 Most of the time 30 4 Always 40 Missing 21

PIS

A 2

00

0 t

ech

nic

al r

epor

t

300


I finish at school - Q32d ST32Q04 1 Never 10 2 Sometimes 20 3 Most of the time 30 4 Always 40 Missing 21

Teachers comment on ST32Q05 1 Never 10 - Q32e 2 Sometimes 20

3 Most of the time 30 4 Always 40 Missing 21

Is interesting - Q32f ST32Q06 1 Never 10 2 Sometimes 20 3 Most of the time 30 4 Always 40 Missing 21

Is counted in <mark> ST32Q07 1 Never 10 - Q32g 2 Sometimes 20

3 Most of the time 30 4 Always 40 Missing 21

Homework <test language> ST33Q01 1 No time 10 - Q33a 2 Less than 1 hour a week 20

3 Between 1 and 3 hours a week 30 4 3 hours or more a week 40 Missing 31

Homework <maths> - Q33b ST33Q02 1 No time 10 2 Less than 1 hour a week 20 3 Between 1 and 3 hours a week 30 4 3 hours or more a week 40 Missing 31

Homework <science> - Q33c ST33Q03 1 No time 10 2 Less than 1 hour a week 20 3 Between 1 and 3 hours a week 30 4 3 hours or more a week 40 Missing 31

Read each day - Q34 ST34Q01 1 I do not read for enjoyment 10 2 30 minutes or less each day 20 3 More than 30 minutes to less 30

than 60 minutes each day 4 1 to 2 hours each day 40 5 More than 2 hours each day 50 Missing 21

Only if I have to - Q35a ST35Q01 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 21

Ap

pen

dic

es

301


Favourite hobby - Q35b ST35Q02 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 21

Talking about books - Q35c ST35Q03 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 21

Hard to finish - Q35d ST35Q04 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 21

Feel happy - Q35e ST35Q05 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31

Waste of time - Q35f ST35Q06 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 21

Enjoy library - Q35g ST35Q07 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31

For information - Q35h ST35Q08 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 21

Few minutes only - Q35i ST35Q09 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 21

Magazines - Q36a ST36Q01 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 51

PIS

A 2

00

0 t

ech

nic

al r

epor

t

302


Comics - Q36b ST36Q02 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11

Fiction - Q36c ST36Q03 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 21

Non-fiction - Q36d ST36Q04 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11

E-mail & Web - Q36e ST36Q05 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 31

Newspapers - Q36f ST36Q06 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 51

How many books at home - Q37 ST37Q01 1 None 10 2 1-10 books 20 3 11-50 books 30 4 51-100 books 40 5 101-250 books 50 6 251-500 books 60 9 More than 500 books 70Missing 51

Borrow books - Q38 ST38Q01 1 Never or hardly ever 10 2 A few times per year 20 3 About once a month 30 4 Several times a month 40 Missing 11

Ap

pen

dic

es

303


How often use school ST39Q01 1 Never or hardly ever 10 library - Q39a 2 A few times a year 20


How often use computers ST39Q02 1 Never or hardly ever 10 - Q39b 2 A few times a year 20


How often use calculators ST39Q03 1 Never or hardly ever 10 - Q39c 2 A few times a year 20


How often use Internet ST39Q04 1 Never or hardly ever 10 - Q39d 2 A few times a year 20


How often use science labs ST39Q05 1 Never or hardly ever 10 - Q39e 2 A few times a year 20


Calculator Use Caluse 1 ‘No calculator’ 102 ‘A simple calculator’ 203 ‘A scientific calculator’ 304 ‘A programmable calculator’ 405 ‘A graphics calculator’ 50Missing 11

Booklet BookID 0 0000000001 1000000002 0100000003 0010000004 0001000005 0000100006 0000010007 0000001008 0000000109 000000001

PIS

A 2

00

0 t

ech

nic

al r

epor

t

304


Mother’s main job BMMJ 1 0102 020…90 900missing 500

Father’s main job BFMJ 1 0102 02090 900missing 501

Multi-column entries without over-bars indicate multiple contrasts. Barred columns are treated as one contrast.

Appendix 9: Sampling Forms

PISA Sampling Form 1 Time of Testing and Age Definition

See Section 3.1 of School Sampling Manual.

PISA Participant:

National Project Manager:

1. Beginning and ending dates of assessment 2000 to 2000

2. Please confirm that the assessment start date is after the first three months of the academic year.

Yes

No

3. Students who will be assessed were born between and D D M M Y Y D D M M Y Y

4. As part of of the PISA sampling process for your country, will you be selecting students other thanthose born between the dates in 3) above? (For example students in a particular grade, no matter whatage.)

No

Yes (please describe the additional population):

Ap

pen

dic

es

305

PISA Sampling Form 2 National Desired Target Population


PISA Participant:


1, Total national population of 15-year-olds: [a]

2. Total national population of 15-year-olds enrolled in educational institutions: [b]

3. Describe the population(s) to be omitted from the national desired target population (if applicable).

4. Total enrolment omitted from the national desired target population [c](corresponding to the omissions listed in the previous item):

5. Total enrolment in the national desired target population: [d]box [b] - box [c]

6. Percentage of coverage in the national desired target population: [e](box [d] / box [b]) x 100

7. Describe your data source (Provide copies of relevant tables):

PIS

A 2

00

0 t

ech

nic

al r

epor

t

306

PISA Sampling Form 3 National Defined Target Population


PISA Participant:


1. Total enrolment in the national desired target population: [a]From box [d] on Sampling Form 2

2. School-level exclusions:

Description of exclusions Number of schools Number of students

TOTAL [b]

Percentage of students not covered due to school-level exclusions: %(box [b] / box [a]) x 100

3. Total enrolment in national defined target population before within-school [c]exclusions: box [a] - box [b]

4. Within-school exclusions:

Description of exclusions Expectednumber of students

TOTAL [d]

Expected percentage of students not covered due to within-school exclusions: %(box [d] / box [a]) x 100

5. Total enrolment in national defined target population: [e](box [a] – (box [b] + box [d])

6. Coverage of national desired target population: [f](box [e] / box [a]) x 100

7. Describe your data source (Provide copies of relevant tables):

Ap

pen

dic

es

307

PISA Sampling Form 4 Sampling Frame Description

See Sections 5.2 – 5.4 of School Sampling Manual.

PISA Participant:


1. Will a sampling frame of geographic areas be used?

Yes Go to 2

No Go to 5

2. Specify the PSU Measure of Size to be used.

15-year-old student enrolment

Total student enrolment

Number of schools

Population size

Other (please describe):

3. Specify the school year for which enrolment data will be used for the PSU Measure of Size:

4. Please provide a preliminary description of the information available to construct the area frame,Please consult with Westat for support and advice in the construction and use of an area-levelsampling frame.

5. Specify the school estimate of enrolment (ENR) of 15-year-olds that will be used.

15-year-old student enrolment

Applying known proportions of 15-year-olds to corresponding grade level enrolments

Grade enrolment of the modal grade for 15-year-olds

Total student enrolment, divided by number of grades

6. Specify the year for which enrolment data will be used for school ENR.

7. Please describe any other type of frame, if any, that will be used.

PIS

A 2

00

0 t

ech

nic

al r

epor

t

308

PISA Sampling Form 5 Excluded Schools


PISA Participant:


Use additional sheets if necessary

School ID Reason for exclusion School ENR

Ap

pen

dic

es

309

PISA Sampling Form 6 Treatment of Small Schools


PISA Participant:


1. Enrolment in small schools:

Type of school based on enrolment Number of Number of Percentage of schools students Total enrolment

Enrolment of 15-year-old students < 17 [a]

Enrolment of 15-year-old students ≥ 17 and < 35 [b]

Enrolment of 15-year-old students ≥ 35 [c]

TOTAL 100%

2. If the sum of the percentages in boxes [a] and [b] is less than 5%, AND the percentage in box [a] is less than 1%,then all schools should be subject to normal school sampling, with no explicit stratification of small schoolsrequired or recommended.

(box [a] + box [b]) < 5% and box [a] < 1%? Yes or No⇓Go to 6.

3. If the sum of the percentages in boxes [a] and [b] is less than 5%, BUT the percentage in box [a] is 1% or more, astratum for very small schools is needed. Please see section 5.7.2 to determine an appropriate school sampleallocation for this stratum of very small schools.

(box [a] + box [b]) < 5% and box [a] ≥ 1%? Yes or No⇓Form an explicit stratum of very small schools and record this on Sampling Form 7.

4. If the sum of the percentages in boxes [a] and [b] is 5% or more, BUT the percentage in box [a] is less than 1%,an explicit stratum of small schools is required, but no special stratum for very small schools is required. Pleasesee Section 5.7.2 to determine an appropriate school sample allocation for this stratum of small schools.

(box [a] + box [b]) ≥ 5% and box [a] < 1%? Yes or No⇓Form an explicit stratum of small schools and record this on Sampling Form 7. Go to 6.

5. If the sum of the percentages in boxes [a] and [b] is 5% or more, AND the percentage in box [a] is 1% or more, anexplicit stratum of small schools is required, AND an explicit stratum for very small schools is required. Please seesection 5.7.2 to determine an appropriate school sample allocation for these strata of moderately small and verysmall schools.

(box [a] + box [b]) ≥ 5% and box [a] ≥ 1%? Yes or No⇓Form an explicit stratum of moderately small schoolsand an explicit stratum of very small schools, and recordthis on Sampling Form 7.

6. If the percentage in box [a] is less than 0.5%, then these very small schools can be excluded from the nationaldefined target population only if the total extent of school level exclusions of the type mentioned in 3.3.2 remainsbelow 0.5%. If these schools are excluded, be sure to record this exclusion on Sampling Form 3, item 2.

box [a] < 0.5%? Yes or No

⇓ Excluding very small schools? Yes or No

PIS

A 2

00

0 t

ech

nic

al r

epor

t

310

PISA Sampling Form 7 Stratification


PISA Participant:


Explicit Stratification

1. List and describe the variables used for explicit stratification.

Explicit stratification variables Number of levels

1

2

3

4

5

2. Total number of explicit strata:

(Note: if the number of explicit strata exceeds 99, the PISA school coding scheme will not work correctly.See Section 6.6. Consult Westat and ACER.)

Implicit Stratification

3. List and describe the variables used for implicit stratification in the order in which they will be used(i.e., sorting of schools within explicit strata).

Implicit stratification variables Number of levels

1

2

3

4

5

Ap

pen

dic

es

311

PISA Sampling Form 8 Population Counts by Strata


PISA Participant:



(1) (2) (3) (4) (5)

Population Counts

Schools Students

Explicit Strata Implicit Strata ENR MOS

PIS

A 2

00

0 t

ech

nic

al r

epor

t

312

PISA Sampling Form 9 Sample Allocation by Explicit Strata


PISA Participant:



(1) (2) (3) (4) (5) (6) (7) Sample Allocation

% of % of eligible

eligible No. of studentsStratum students in schools in expected

Explicit Strata Identification population population Schools Students in sample

Ap

pen

dic

es

313

PISA Sampling Form 10 School Sample Selection


PISA Participant:


Explicit Stratum: Stratum ID

S D I RN[a] Total Measure of Size [b] Desired Sample Size [c] Sampling Interval [d] Random Number


(1) (2)Line Numbers Selection Numbers

PIS

A 2

00

0 t

ech

nic

al r

epor

t

314

PISA Sampling Form 11 School Sampling Frame

See Sections 6.4 - 6.6 of School Sampling Manual.

PISA Participant:


Explicit Stratum: Stratum ID


(1) (2) (3) (4) (5) (6) School List Cumulative PISA School

Identification Implicit Stratum MOS MOS Sample Status Identification

Ap

pen

dic

es

315

Appendix 10: Student Listing Form

Page __ of __

School Identification: Country Name:

School Name: List Prepared By:

Address: Telephone Number:

Date List Prepared:

Total Number Students Listed:

DIRECTIONS: PLEASE COMPLETE COLUMNS A, B, C AND D, FOR EVERY STUDENT <BORN IN 1984>.

Include students who may be excluded from other testing programs, such as some students withdisabilities or limited language proficiency. Detailed instructions and information about providingcomputer-generated lists are on the other side of this page.

For Sampling Only (A) (B) (C) (D)Selected Student’s Name Sex BirthStudent Date

(Enter “S”) Line # (First Middle Initial Last) Grade (M / F) (mm/yy)

PIS

A 2

00

0 t

ech

nic

al r

epor

t

316

A. Instructions for Preparing a List of Eligible Students

1. Please prepare a list of ALL students <born in 1984. . .NPM must insert eligibility criteria> using the most currentenrolment records available.

2. Include on the list students who typically may be excluded from other testing programs (such as some students withdisabilities or limited language proficiency).

3. Write the name for each eligible student. Please also specify current grade, sex, and birth date for each student.

4. If confidentiality is a concern in listing student names, then a unique student identifier may be substituted. Becausesome students may have the same or similar names, it is important to include a birth date for each student.

5. The list may be computer-generated or prepared manually using the PISA Student Listing Form. A Student ListingForm is on the reverse side of these instructions. You may copy this form or request copies from your NationalProject Manager.

6. If you use the Student Listing Form on the reverse side of this page, do not write in the “For Sampling Only”columns.

7. Send the list to the National Project Manager (NPM) to arrive no later than <insert DATE>. Please address to theNPM as follows: <insert name and mailing address>

B. Suggestions for Preparing Computer-Generated Lists

1. Write the school name and address on list.

2. List students in alphabetical order.

3. Number the students.

4. Double-space the list.

5. Allow left-hand margin of at least two inches.

6. Include the date the printout was prepared.

7. Define any special codes used.

8. Include preparer’s name and telephone number.

Ap

pen

dic

es

317

PIS

A 2

00

0 t

ech

nic

al r

epor

t

318

Ap

pe

nd

ix 1

1:S

tu

de

nt

Tr

ac

kin

g F

or

m

Page __ of __

Cou

ntry

Nam

e:__

____

____

____

____

____

____

____

____

____

____

Stra

tum

Num

ber:

___

____

____

____

_Sc

hool

Nam

e:__

____

____

____

____

____

____

____

____

____

____

Scho

ol I

D:

____

____

____

____

____

__

SAM

PL

ING

IN

FOR

MA

TIO

N

(A)

(B)

(C)

(D)

(E)

(F)

Num

ber

of S

tude

nts

Num

ber

of S

tude

nts

Lis

ted

Sam

ple

Size

Ran

dom

Sa

mpl

ing

Firs

t L

ine

Num

ber

Age

d 15

for

Sam

plin

gN

umbe

rIn

terv

alSe

lect

ed[(

Box

D X

Box

E)

+ 1

____

____

__

____

__

____

____

0.

____

___

____

____

__

____

__

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

Part

icip

atio

n St

atus

Lin

e (9

)(1

0)N

umbe

r G

ende

rB

irth

Dat

e E

xclu

ded

Boo

klet

Ori

gina

l Ses

sion

Follo

w-u

p Se

ssio

n

ID(S

ampl

e)St

uden

t N

ame

Gra

de(M

/F)

(MM

-YY

)C

ode

Num

ber

P1

P2

SQ

P1

P2

SQ

1 2 3 4 5 6 7 8 9 10 11 12 13

14

15

EX

CL

USI

ON

CO

DE

S (C

ol. 7

)PA

RT

ICIP

AT

ION

CO

DE

S (C

ols.

9&

10)

1 =

Func

tion

ally

dis

able

d0

= A

bsen

t2

= E

duca

ble

men

tally

ret

arde

d1

= Pr

esen

t3

= L

imit

ed t

est

lang

uage

pro

fici

ency

8 =

Not

app

licab

le, i

.e.,

excl

uded

, no

long

er in

sch

ool,

not

age

elig

ible

4 =

Oth

er

(con

tinu

ed)

PISA

ST

UD

EN

T T

RA

CK

ING

FO

RM

Page

__

of _

_C

ount

ry N

ame:

____

____

____

____

____

____

____

____

____

____

__St

ratu

m N

umbe

r: _

____

____

____

___

Scho

ol N

ame:

____

____

____

____

____

____

____

____

____

____

__Sc

hool

ID

: __

____

____

____

____

____

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

Part

icip

atio

n St

atus

Lin

e (9

)(1

0)N

umbe

rG

ende

rB

irth

Dat

e E

xclu

ded

Boo

klet

Ori

gina

l Ses

sion

Follo

w-u

p Se

ssio

n

ID(S

ampl

e)St

uden

t N

ame

Gra

de(M

/F)

(MM

-YY

)C

ode

Num

ber

P1

P2

SQ

P1

P2

SQ

16

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Ap

pen

dic

es

319

Appendix 12: Adjustment to BRR for Strata with oddnumbers of Primary Sampling Units

Splitting an odd-sized sample into unequal half-samples without adjustment results in positively biasedestimates of variance with BRR. The bias may not be serious if the number of units is large and if thesample has a single-stage design. If, however, the sample has a multi-stage design and the number ofsecond-stage units per first-stage unit is even moderately large, the bias may be severe. The followingadjustment, developed by David Judkins of Westat, removes the bias.

The units in each stratum are divided into half-samples. Without loss of generality, assume that thefirst half-sample is the smaller of the two if the number of sample units in a stratum is odd. For PISA,

or 3. Let =least integer greater than or equal to ( = 1 or 2 for PISA), =greatest integerless than or equal to ( = 1 for PISA), 0≤k<1 be the Fay factor (k=0.5 for PISA) and

according to a rule such that for (as is obtained from a Hadamard matrix of

appropriate size).

Then the replication factors for the first and second half-samples are:

. (89)

Hence for PISA, if , the replication factors are and , that is 1.5 and 0.5, or

0.5 and 1.5, for the first and second half-samples respectively. If , the replication factors are:

and , (90)

that is, 1.7071 and 0.6464, or 0.2929 and 1.3536, where the first factor in each case is the one that appliesto the single PSU in the first half-sample, while the second factor applies to the two PSUs in the second half-sample.

ProofLet be an unbiased estimate of the characteristic of interest for the i-th unit in the h-th stratum. Let

be the full sample estimate. Let and denote membership of half-samples 1 and 2

respectively. Let L be the number of strata and T be the number of replicates (T=80 for PISA). Let

.

Then

. (91)

vk T

d k x dk

x x

k Td k

ht h hi hi hth

hi hii

n

i

n

h

L

t

T

ht h hi

hh

=−( )

+ −( )( ) + −−( )

−

=−( )

−( )

====∑∑∑∑1

1

11 1 1

1

1

1

11

2 1 2111

2

1

2

α δα

δ

α δ 11 2111

2

1

1 2111

11

1 1

x d k x

Td x x

hi hth

hi hii

n

i

n

h

L

t

T

ht h hi hih

hi hii

n

i

n

h

L

hh

hh

− −( )

= −

====

===

∑∑∑∑

∑∑∑

αδ

α δα

δ

22

1t

T

=∑

α h

h

h

u=

l

δhi2δhi1x xh i

hi= ∑ ∑xhi

1 2 2−( )dht1 2+ ( )dht

nh = 3

1 2− dht1 2+ dhtnh = 2

1 1 1 1+ −( ) − −( )d k

ud k

uhth

hthl

l

h

h and

h h≠ 'd dht h t' =∑ 0

dht = ±1 lhnh 2 lhuhnh 2uhnh = 2

nh

PIS

A 2

00

0 t

ech

nic

al r

epor

t

320

Given the balancing property of , this simplifies to:

. (92)

For convenience, rewrite the simplified as:

. (93)

To prove unbiasedness, note that , where (assuming identicaldistribution within stratum), and that:

. (94)

Note furthermore that under simple random sampling with replacement,

(95)

Thus,

(96)

which is the true sampling variance of x.

E v E x x Var x Var x

u

n

h h h hh

L

h

L

h h hh

L

h hh

L

ˆ ˆ ˆ ˆ ˆ

,

( ) = −( ) = ( ) + ( )[ ]

= +( )

=

==

=

=

∑∑

∑

∑

1 22

1 211

2

1

2

1

l σ

σ

Var x Var x

u

Var x u

h h h h h hi

h h

hh

h h

h

ˆ

ˆ

.

12 2 2

2

2 22

2

1

( ) = = ( )=

( ) =

=

α σ σ

σ

ασ

σ

l

l

, where

and

h

E x u X X u E xh

hh h h h h hˆ ˆ2 1

1( ) = = = ( )α

l

X E xh hi= ( ) E x X X uh h h h h h hˆ 1( ) = =α l l

ˆ ˆ ˆv x xh hh

L

= −( )=

∑ 1 22

1

v

vT

d x x

x x

hth

L

h hi hii

n

hhi hi

i

n

t

T

h hi hih

hi hii

n

i

n

h

L

h h

hh

= −

= −

= = ==

===

∑ ∑ ∑∑

∑∑∑

1 1

1

2

11

12

1

2

1

1 211

2

1

α δα

δ

α δα

δ

dht

Ap

pen

dic

es

321

pisa 2000 technical report - oecd.org · pisa 2000 technical report edited byray adams and margaret...

Documents