pisa 2000 technical report - oecd.org · pisa 2000 technical report edited byray adams and margaret...
TRANSCRIPT
PISA 2000 Technical Report
EDITED BY Ray Adams AND Margaret Wu
OECDORGANISATION FOR ECONOMIC CO-OPERATION AND DEVELOPMENT
ORGANISATION FOR ECONOMIC CO-OPERATION AND DEVELOPMENT
Pursuant to Article 1 of the Convention signed in Paris on 14th December 1960, and which came intoforce on 30th September 1961, the Organisation for Economic Co-operation and Development (OECD)shall promote policies designed:
– to achieve the highest sustainable economic growth and employment and a rising standard ofliving in Member countries, while maintaining financial stability, and thus to contribute to thedevelopment of the world economy;
– to contribute to sound economic expansion in Member as well as non-member countries in theprocess of economic development; and
– to contribute to the expansion of world trade on a multilateral, non-discriminatory basis inaccordance with international obligations.
The original Member countries of the OECD are Austria, Belgium, Canada, Denmark, France,Germany, Greece, Iceland, Ireland, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain,Sweden, Switzerland, Turkey, the United Kingdom and the United States. The following countriesbecame Members subsequently through accession at the dates indicated hereafter: Japan(28th April 1964), Finland (28th January 1969), Australia (7th June 1971), New Zealand (29th May 1973),Mexico (18th May 1994), the Czech Republic (21st December 1995), Hungary (7th May 1996), Poland(22nd November 1996), Korea (12th December 1996) and the Slovak Republic (14th December 2000). TheCommission of the European Communities takes part in the work of the OECD (Article 13 of the OECDConvention).
© OECD 2002Permission to reproduce a portion of this work for non-commercial purposes or classroom use should be obtainedthrough the Centre français d’exploitation du droit de copie (CFC), 20, rue des Grands-Augustins, 75006 Paris,France, tel. (33-1) 44 07 47 70, fax (33-1) 46 34 67 19, for every country except the United States. In the United Statespermission should be obtained through the Copyright Clearance Center, Customer Service, (508)750-8400,222 Rosewood Drive, Danvers, MA 01923 USA, or CCC Online: www.copyright.com. All other applications forpermission to reproduce or translate all or part of this book should be made to OECD Publications, 2, rue André-Pascal,75775 Paris Cedex 16, France.
histo_gen_a_20x27.fm Page 1 Friday, August 23, 2002 1:32 PM
FOREWORD
Are students well prepared to meet the challenges of the future? Are they able to analyse, reason andcommunicate their ideas effectively? Do they have the capacity to continue learning throughout life?Parents, students, the public and those who run education systems need to know the answers to thesequestions.
Many education systems monitor student learning in order to provide some answers to these questions.Comparative international analyses can extend and enrich the national picture by providing a largercontext within which to interpret national results. They can show countries their areas of relative strengthand weakness and help them to monitor progress and raise aspirations. They can also provide directionsfor national policy, for schools’ curriculum and instructional efforts and for students’ learning. Coupledwith appropriate incentives, they can motivate students to learn better, teachers to teach better, andschools to be more effective.
In response to the need for internationally comparable evidence on student performance, the OECDlaunched the Programme for International Student Assessment (PISA). PISA represents a new commitmentby the governments of OECD countries to monitor the outcomes of education systems in terms of studentachievement on a regular basis and within a common framework that is internationally agreed upon.PISA aims at providing a new basis for policy dialogue and for collaboration in defining and operationalisingeducational goals – in innovative ways that reflect judgements about the skills that are relevant to adultlife. It provides inputs for standard-setting and evaluation; insights into the factors that contribute to thedevelopment of competencies and into how these factors operate in different countries, and it should leadto a better understanding of the causes and consequences of observed skill shortages. By supporting ashift in policy focus from educational inputs to learning outcomes, PISA can assist countries in seeking tobring about improvements in schooling and better preparation for young people as they enter an adultlife of rapid change and deepening global interdependence.
PISA is a collaborative effort, bringing together scientific expertise from the participating countries,steered jointly by their governments on the basis of shared, policy-driven interests. Participating countriestake responsibility for the project at the policy level through a Board of Participating Countries. Expertsfrom participating countries serve on working groups that are charged with linking the PISA policyobjectives with the best available substantive and technical expertise in the field of international comparativeassessment of educational outcomes. Through participating in these expert groups, countries ensure thatthe PISA assessment instruments are internationally valid and take into account the cultural and curricularcontexts of OECD Member countries, that they provide a realistic basis for measurement, and that theyplace an emphasis on authenticity and educational validity. The frameworks and assessment instrumentsfor PISA 2000 are the product of a multi-year development process and were adopted by OECD Membercountries in December 1999.
First results from the PISA 2000 assessment were published in Knowledge and Skills for Life - FirstResults from PISA 2000 (2001). This publication presents evidence on the performance in reading,mathematical and scientific literacy of students, schools and countries, provides insights into the factorsthat influence the development of these skills at home and at school, and examines how these factorsinteract and what the implications are for policy development.
PISA is methodologically highly complex, requiring intensive collaboration among many stakeholders.The successful implementation of PISA depends on the use, and sometimes further development, of state-of-the-art methodologies. The PISA Technical Report describes those methodologies, along with otherfeatures that have enabled PISA to provide high quality data to support policy formation and review. Thedescriptions are provided at a level of detail that will enable review and potentially replication of theimplemented procedures and technical solutions to problems.
The PISA Technical Report is the product of a collaborative effort between the countries participatingin PISA and the experts and the institutions working within the framework of the International PISAConsortium that was contracted by the OECD with the implementation of PISA. The report was preparedby the International PISA Consortium, under the direction of Raymond Adams and Margaret Wu. Authorsand contributors to the individual chapters of this report are identified in the corresponding chapters.
For
ewor
d
3
TABLE OF CONTENTS
CHAPTER 1. THE PROGRAMME FOR INTERNATIONAL STUDENT ASSESSMENT: AN OVERVIEW.......15Managing and implementing PISA ......................................................................................................17This report ..........................................................................................................................................17
READER’S GUIDE.....................................................................................................................................................................19
SECTION ONE: INSTRUMENT DESIGN
CHAPTER 2. TEST DESIGN AND TEST DEVELOPMENT .................................................................21Test scope and test format...................................................................................................................21Timeline ..............................................................................................................................................21Assessment framework and test design................................................................................................22
Development of the assessment frameworks ...................................................................................22Test design ......................................................................................................................................22
Test development team........................................................................................................................23Test development process ....................................................................................................................23
Item submissions from participating countries ................................................................................23Item writing ....................................................................................................................................23The item review process ..................................................................................................................24Pre-pilots and translation input.......................................................................................................24Preparing marking guides and marker training material .................................................................25
PISA field trial and item selection for the main study .........................................................................25Item bank software .............................................................................................................................26PISA 2000 main study.........................................................................................................................26PISA 2000 test characteristics .............................................................................................................27
CHAPTER 3. STUDENT AND SCHOOL QUESTIONNAIRE DEVELOPMENT.................................33Overview.............................................................................................................................................33Development process...........................................................................................................................34
The questions ..................................................................................................................................34The framework ...............................................................................................................................34
Coverage .............................................................................................................................................36Student questionnaire......................................................................................................................36School questionnaire .......................................................................................................................36
Cross-national appropriateness of items..............................................................................................37Cross-curricular competencies and information technology ................................................................37
The cross-curricular competencies instrument.................................................................................37The information technology or computer familiarity instrument ....................................................38
SECTION TWO: OPERATIONS
CHAPTER 4. SAMPLE DESIGN .............................................................................................................39Target population and overview of the sampling design......................................................................39Population coverage, and school and student participation rate standards .........................................40
Coverage of the international PISA target population.....................................................................40Accuracy and precision ...................................................................................................................41School response rates ......................................................................................................................41Student response rates.....................................................................................................................42
Main study school sample...................................................................................................................42Definition of the national target population....................................................................................42The sampling frame ........................................................................................................................43
PIS
A 2
00
0 t
ech
nic
al r
epor
t
4
Stratification ...................................................................................................................................43Assigning a measure of size to each school .....................................................................................48School sample selection...................................................................................................................48Student samples...............................................................................................................................53
CHAPTER 5. TRANSLATION AND CULTURAL APPROPRIATENESS OF THE TEST AND SURVEYMATERIAL .............................................................................................................................................57
Introduction ........................................................................................................................................57Double translation from two source languages ...................................................................................57Development of source versions ..........................................................................................................59PISA translation guidelines..................................................................................................................60Translation training session.................................................................................................................61International verification of the national versions ...............................................................................61Translation and verification procedure outcomes................................................................................63
Linguistic characteristics of English and French source versions of the stimuli ...............................64Psychometric quality of the French and English versions ................................................................66Psychometric quality of the national versions .................................................................................67
Discussion ...........................................................................................................................................69
CHAPTER 6. FIELD OPERATIONS .......................................................................................................71Overview of roles and responsibilities .................................................................................................71
National Project Managers .............................................................................................................71School Co-ordinators ......................................................................................................................72Test administrators..........................................................................................................................72
Documentation ...................................................................................................................................72Materials preparation..........................................................................................................................73
Assembling test booklets and questionnaires...................................................................................73Printing test booklets and questionnaires ........................................................................................73Packaging and shipping materials ...................................................................................................74Receipt of materials at the National Centre after testing ................................................................75
Processing tests and questionnaires after testing..................................................................................75Steps in the marking process ...........................................................................................................75Logistics prior to marking...............................................................................................................77How marks were shown .................................................................................................................78Marker identification numbers........................................................................................................78Design for allocating booklets to markers.......................................................................................78Managing the actual marking .........................................................................................................81Multiple marking ............................................................................................................................82Cross-national marking...................................................................................................................84Questionnaire coding ......................................................................................................................84
Data entry, data checking and file submission.....................................................................................84Data entry .......................................................................................................................................84Data checking .................................................................................................................................84Data submission..............................................................................................................................84After data were submitted...............................................................................................................84
CHAPTER 7. QUALITY MONITORING ..............................................................................................85Preparation of quality monitoring procedures.....................................................................................85Implementing quality monitoring procedures......................................................................................86
Training National Centre Quality Monitors and School Quality Monitors ....................................86National Centre Quality Monitoring ..............................................................................................86School Quality Monitoring .............................................................................................................86Site visit data...................................................................................................................................87 T
able
of
con
ten
ts
5
SECTION THREE: DATA PROCESSING
CHAPTER 8. SURVEY WEIGHTING AND THE CALCULATION OF SAMPLING VARIANCE........89Survey weighting .................................................................................................................................89
The school base weight ...................................................................................................................90The school weight trimming factor .................................................................................................91The student base weight..................................................................................................................91School non-response adjustment .....................................................................................................91Grade non-response adjustment ......................................................................................................94Student non-response adjustment....................................................................................................94Trimming student weights...............................................................................................................95Subject-specific factors for mathematics and science .......................................................................96
Calculating sampling variance.............................................................................................................96The balanced repeated replication variance estimator .....................................................................96Reflecting weighting adjustments ....................................................................................................97Formation of variance strata ...........................................................................................................98Countries where all students were selected for PISA .......................................................................98
CHAPTER 9. SCALING PISA COGNITIVE DATA ................................................................................99The mixed coefficients multinomial logit model..................................................................................99The population model .......................................................................................................................100Combined model...............................................................................................................................101Application to PISA ..........................................................................................................................101National calibrations.........................................................................................................................101
Item response model file (infit mean square).................................................................................102Discrimination coefficients ............................................................................................................102Item-by-country interaction...........................................................................................................102National reports............................................................................................................................102
International calibration....................................................................................................................105Student score generation ...................................................................................................................105
Computing maximum likelihood estimates in PISA ......................................................................105Plausible values .............................................................................................................................105Constructing conditioning variables..............................................................................................107
Analysis of data with plausible values...............................................................................................107
CHAPTER 10. CODING AND MARKER RELIABILITY STUDIES ....................................................109Examining within-country variability in marking .............................................................................109
Homogeneity analysis ...................................................................................................................109Homogeneity analysis with restrictions .........................................................................................112An additional criterion for selecting items.....................................................................................114A comparison between field trial and main study .........................................................................116Variance component analysis ........................................................................................................116
Inter-country rater reliability study - Design .....................................................................................124Recruitment of international markers ...........................................................................................125Inter-country rater reliability study training sessions.....................................................................125Flag files........................................................................................................................................126Adjudication .................................................................................................................................126
CHAPTER 11. DATA CLEANING PROCEDURES ..............................................................................127Data cleaning at the National Centre................................................................................................127Data cleaning at the International Centre .........................................................................................128
Data cleaning organisation............................................................................................................128
PIS
A 2
00
0 t
ech
nic
al r
epor
t
6
Data cleaning procedures ..................................................................................................................128National adaptations to the database............................................................................................128Verifying the student tracking form and the list of schools ...........................................................128Verifying the reliability data..........................................................................................................129Verifying the context questionnaire data.......................................................................................129
Preparing files for analysis ................................................................................................................129Processed cognitive file..................................................................................................................130The student questionnaire .............................................................................................................130The school questionnaire ..............................................................................................................130The weighting files ........................................................................................................................130The reliability files ........................................................................................................................131
SECTION FOUR: QUALITY INDICATORS AND OUTCOMES
CHAPTER 12. SAMPLING OUTCOMES.............................................................................................133Design effect and effective sample size ..............................................................................................141
CHAPTER 13. SCALING OUTCOMES................................................................................................149International characteristics of the item pool ....................................................................................149
Test targeting ................................................................................................................................149Test reliability ...............................................................................................................................152Domain inter-correlations .............................................................................................................152Reading sub-scales ........................................................................................................................153
Scaling outcomes...............................................................................................................................153National item deletions .................................................................................................................153International scaling......................................................................................................................154Generating student scale scores .....................................................................................................154
Differential item functioning .............................................................................................................154Test length.........................................................................................................................................156Magnitude of booklet effects.............................................................................................................157
Correction of the booklet effect ....................................................................................................160
CHAPTER 14. OUTCOMES OF MARKER RELIABILITY STUDIES .................................................163Within-country reliability studies ......................................................................................................163
Variance components analysis .......................................................................................................163Generalisability coefficients...........................................................................................................163
Inter-country rater reliability study - Outcome..................................................................................174Overview of the main results ........................................................................................................174Main results by item .....................................................................................................................174Main study by country..................................................................................................................176
CHAPTER 15. DATA ADJUDICATION ...............................................................................................179Overview of response rate issues .......................................................................................................179Follow-up data analysis approach.....................................................................................................181Review of compliance with other standards......................................................................................182Detailed country comments...............................................................................................................182
Summary.......................................................................................................................................182Australia .......................................................................................................................................182Austria ..........................................................................................................................................183Belgium .........................................................................................................................................183Brazil.............................................................................................................................................184Canada..........................................................................................................................................184Czech Republic .............................................................................................................................184
Tab
le o
f co
nte
nts
7
Denmark .......................................................................................................................................184Finland..........................................................................................................................................185France ...........................................................................................................................................185Germany .......................................................................................................................................185Greece ...........................................................................................................................................185Hungary........................................................................................................................................185Iceland ..........................................................................................................................................185Ireland...........................................................................................................................................185Italy...............................................................................................................................................185Japan.............................................................................................................................................185Korea ............................................................................................................................................186Latvia............................................................................................................................................186Liechtenstein .................................................................................................................................186Luxembourg..................................................................................................................................186Mexico..........................................................................................................................................186Netherlands...................................................................................................................................186New Zealand ................................................................................................................................187Norway.........................................................................................................................................189Poland...........................................................................................................................................189Portugal ........................................................................................................................................189Russian Federation........................................................................................................................189Spain .............................................................................................................................................190Sweden..........................................................................................................................................190Switzerland ...................................................................................................................................190United Kingdom............................................................................................................................191United States .................................................................................................................................191
SECTION FIVE: SCALE CONSTRUCTION AND DATA PRODUCTS
CHAPTER 16. PROFICIENCY SCALES CONSTRUCTION ...............................................................195Introduction ......................................................................................................................................195Development of the described scales .................................................................................................195
Stage 1: Identifying possible sub-scales .........................................................................................195Stage 2: Assigning items to scales..................................................................................................196Stage 3: Skills audit .......................................................................................................................196Stage 4: Analysing field trial data..................................................................................................196Stage 5: Defining the dimensions ..................................................................................................196Stage 6: Revising and refining main study data.............................................................................196Stage 7: Validating ........................................................................................................................197
Defining proficiency levels ................................................................................................................197Reading literacy ................................................................................................................................199
Scope.............................................................................................................................................199Three reading proficiency scales....................................................................................................199Task variables ...............................................................................................................................200Task variables for the combined reading literacy scale..................................................................200Task variables for the retrieving information sub-scale .................................................................201Task variables for the interpreting texts sub-scale .........................................................................201Task variables for the reflection and evaluation sub-scale.............................................................202Retrieving information sub-scale...................................................................................................204Interpreting texts sub-scale............................................................................................................205Reflection and evaluation sub-scale...............................................................................................206Combined reading literacy scale....................................................................................................207
PIS
A 2
00
0 t
ech
nic
al r
epor
t
8
Cut-off points for reading .............................................................................................................207Mathematical literacy........................................................................................................................208
What is being assessed?.................................................................................................................208Illustrating the PISA mathematical literacy scale ...........................................................................210
Scientific literacy ...............................................................................................................................211What is being assessed?.................................................................................................................211Illustrating the PISA scientific literacy scale ..................................................................................213
CHAPTER 17. CONSTRUCTING AND VALIDATING THE QUESTIONNAIRE INDICES .............217Validation procedures of indices .......................................................................................................218Indices ...............................................................................................................................................219
Student characteristics and family background .............................................................................219Parental interest and family relations ............................................................................................221Family possessions ........................................................................................................................224Instruction and learning ................................................................................................................226School and classroom climate .......................................................................................................228Reading habits ..............................................................................................................................230Motivation and interest.................................................................................................................232Learning strategies ........................................................................................................................234Learning style................................................................................................................................237Self-concept...................................................................................................................................238Learning confidence ......................................................................................................................240Computer familiarity or information technology ..........................................................................242Basic school characteristics ...........................................................................................................244School policies and practices .........................................................................................................245School climate ...............................................................................................................................246School resources............................................................................................................................249
CHAPTER 18. INTERNATIONAL DATABASE...................................................................................253Files in the database ..........................................................................................................................253
The student files............................................................................................................................253The school file...............................................................................................................................254The assessment items data file.......................................................................................................254
Records in the database ....................................................................................................................254Records included in the database ..................................................................................................254Records excluded in the database .................................................................................................254
Representing missing data .................................................................................................................254How are students and schools identified? .........................................................................................255The student files ................................................................................................................................255The cognitive files .............................................................................................................................256The school file...................................................................................................................................256Further information ..........................................................................................................................257
REFERENCES ......................................................................................................................................259
APPENDIX 1. Summary of PISA reading literacy items........................................................................262
APPENDIX 2. Summary of PISA mathematical literacy items ..............................................................266
APPENDIX 3. Summary of PISA scientific literacy items......................................................................267
Tab
le o
f co
nte
nts
9
APPENDIX 4. The validity studies of occupation and socio-economic status (Summary) ....................268Canada..............................................................................................................................................268Czech Republic .................................................................................................................................268France ...............................................................................................................................................268United Kingdom................................................................................................................................269
APPENDIX 5. The main findings from quality monitoring of the PISA 2000 main study....................270School Quality Monitor Reports.......................................................................................................270Preparation for the assessment ..........................................................................................................272Beginning the testing sessions............................................................................................................273Conducting the testing sessions .........................................................................................................273The student questionnaire session .....................................................................................................274General questions concerning the assessment ....................................................................................274Summary...........................................................................................................................................275Interview with the PISA school co-ordinator.....................................................................................275National Centre Quality Monitor Reports ........................................................................................277Staffing..............................................................................................................................................277School Quality Monitors...................................................................................................................278Markers ............................................................................................................................................278Administrative Issues.........................................................................................................................278Translation and Verification..............................................................................................................279Adequacy of the manuals ..................................................................................................................280Conclusion ........................................................................................................................................281
APPENDIX 6. Order effects figures ......................................................................................................282
APPENDIX 7. PISA 2000 Expert Group membership and Consortium staff ........................................286PISA Consortium ..............................................................................................................................286
Australian Council for Educational Research................................................................................286Westat ...........................................................................................................................................286Citogroep ......................................................................................................................................286Educational Testing Service ...........................................................................................................286Other experts ................................................................................................................................286
Reading Functional Expert Group ....................................................................................................287Mathematics Functional Expert Group .............................................................................................287Science Functional Expert Group ......................................................................................................287Cultural Review Panel.......................................................................................................................287Technical Advisory Group.................................................................................................................287
APPENDIX 8. Contrast coding for PISA 2000 conditioning variables..................................................288
APPENDIX 9. Sampling forms .............................................................................................................305
APPENDIX 10. Student listing form.....................................................................................................316
APPENDIX 11. Student tracking form..................................................................................................318
APPENDIX 12. Adjustment to BRR for strata with odd numbers of primary sampling units ..............320
PIS
A 2
00
0 t
ech
nic
al r
epor
t
10
Tab
le o
f co
nte
nts
11
List of Figures
1. PISA 2000 in a Nutshell...................................................................................................................162. Test Booklet Design for the Main Survey.........................................................................................233. Mapping the Coverage of the PISA 2000 Questionnaires ................................................................354 School Response Rate Standards - Criteria for Acceptability ...........................................................425. Student Exclusion Criteria as Conveyed to National Project Managers ...........................................566. Allocation of the Booklets for Single Marking of Reading by Cluster..............................................797. Booklet Allocation for Single Marking of Mathematics and Science by Cluster, Common Markers808. Booklet Allocation for Single Marking of Mathematics and Science by Cluster, Separate Markers .809. Booklet Allocation to Markers for Multiple Marking of Reading....................................................8210. Booklet Allocation for Multiple Marking of Mathematics and Science, Common Markers.............8311. Booklet Allocation for Multiple Marking of Mathematics...............................................................8312. Example of Item Statistics Shown in Report 1 ...............................................................................10313. Example of Item Statistics Shown in Report 2 ...............................................................................10314. Example of Item Statistics Shown in Report 3 ...............................................................................10415. Example of Item Statistics Shown in Report 4 ...............................................................................10416. Example of Item Statistics Shown in Report 5 ...............................................................................10517. Notational System..........................................................................................................................11018. Quantification of Categories ..........................................................................................................11119. H* and H** for Science Items .......................................................................................................11420. Homogeneity and Departure from Uniformity...............................................................................11621. Comparison of Homogeneity Indices for Reading Items in the Main Study and the Field Trial in
Australia........................................................................................................................................11722. Comparison of Reading Item Difficulty and Achievement Distributions........................................15123. Comparison of Mathematics Item Difficulty and Achievement Distributions ................................15124. Comparison of Science Item Difficulty and Achievement Distributions .........................................15225. Items Deleted for Particular Countries ...........................................................................................15326. Comparison of Item Parameter Estimates for Males and Females in Reading................................15527. Comparison of Item Parameter Estimates for Males and Females in Mathematics ........................15528. Comparison of Item Parameter Estimates for Males and Females in Science .................................15529. Item Parameter Differences for the Items in Cluster One ...............................................................16130. Plot of Attained PISA School Response Rates ................................................................................18131. What it means to be at a Level ......................................................................................................19832. The Retrieving Information Sub-Scale............................................................................................20433. The Interpreting Texts Sub-Scale....................................................................................................20534. The Reflection and Evaluation Sub-Scale .......................................................................................20635. The Combined Reading Literacy Scale...........................................................................................20736. Combined Reading Literacy Scale Level Cut-Off Points ................................................................20737. The Mathematical Literacy Scale ...................................................................................................20938. Typical Mathematical Literacy Tasks .............................................................................................21139. The Scientific Literacy Scale...........................................................................................................21240. Characteristics of the Science Unit, Semmelweis ............................................................................21441. Characteristics of the Science Unit, Ozone.....................................................................................21542. Item Parameter Estimates for the Index of Cultural Communication (CULTCOM) ......................22143. Item Parameter Estimates for the Index of Social Communication (SOCCOM) ............................22244. Item Parameter Estimates for the Index of Family Educational Support (FAMEDSUP) .................22245. Item Parameter Estimates for the Index of Cultural Activities (CULTACTV).................................22246. Item Parameter Estimates for the Index of Family Wealth (WEALTH) ..........................................22447. Item Parameter Estimates for the Index of Home Educational Resources (HEDRES)....................22448. Item Parameter Estimates for the Index of Possessions Related to ‘Classical Culture’ in the Family
Home (CULTPOSS) .......................................................................................................................225
49. Item Parameter Estimates for the Index of Time Spent on Homework (HMWKTIME).................22650. Item Parameter Estimates for the Index of Teacher Support (TEACHSUP)....................................22651. Item Parameter Estimates for the Index of Achievement Press (ACHPRESS) .................................22752. Item Parameter Estimates for the Index of Disciplinary Climate (DISCLIM).................................22853. Item Parameter Estimates for the Index of Teacher-Student Relations (STUDREL) .......................22954. Item Parameter Estimates for the Index of Sense of Belonging (BELONG)....................................22955. Item Parameter Estimates for the Index of Engagement in Reading (JOYREAD) ..........................23156. Item Parameter Estimates for the Index of Reading Diversity (DIVREAD)....................................23157. Item Parameter Estimates for the Index of Instrumental Motivation (INSMOT)...........................23258. Item Parameter Estimates for the Index of Interest in Reading (INTREA).....................................23359. Item Parameter Estimates for the Index of Interest in Mathematics (INTMAT) ............................23360. Item Parameter Estimates for the Index of Control Strategies (CSTRAT) ......................................23561. Item Parameter Estimates for the Index of Memorisation (MEMOR) ...........................................23562. Item Parameter Estimates for the Index of Elaboration (ELAB).....................................................23563. Item Parameter Estimates for the Index of Effort and Perseverance (EFFPER) ..............................23664. Item Parameter Estimates for the Index of Co-operative Learning (COPLRN)..............................23765. Item Parameter Estimates for the Index of Competitive Learning (COMLRN)..............................23766. Item Parameter Estimates for the Index of Self-Concept in Reading (SCVERB).............................23967. Item Parameter Estimates for the Index of Self-Concept in Mathematics (MATCON) ..................23968. Item Parameter Estimates for the Index of Academic Self-Concept (SCACAD) .............................23969. Item Parameter Estimates for the Index of Perceived Self-Efficacy (SELFEF) .................................24170. Item Parameter Estimates for the Index of Control Expectation (CEXP).......................................24171. Item Parameter Estimates for the Index of Comfort With and Perceived Ability to Use Computers
(COMAB) .............................................................................................................................................24272. Item Parameter Estimates for the Index of Computer Usage (COMUSE) ......................................24373. Item Parameter Estimates for the Index of Interest in Computers (COMATT) ..............................24374. Item Parameter Estimates for the Index of School Autonomy (SCHAUTON) ...............................24575. Item Parameter Estimates for the Index of Teacher Autonomy (TCHPARTI) ................................24576. Item Parameter Estimates for the Index of Principals’ Perceptions of Teacher-Related Factors
Affecting School Climate (TEACBEHA) ........................................................................................24777. Item Parameter Estimates for the Index of Principals’ Perceptions of Student-Related Factors
Affecting School Climate (STUDBEHA) ........................................................................................24778. Item Parameter Estimates for the Index of Principals’ Perceptions of Teachers' Morale and
Commitment (TCMORALE) .........................................................................................................24779. Item Parameter Estimates for the Index of the Quality of the Schools’ Educational Resources
(SCMATEDU)................................................................................................................................24980. Item Parameter Estimates for the Index of the Quality of the Schools’ Physical Infrastructure
(SCMATBUI) .................................................................................................................................24981. Item Parameter Estimates for the Index of Teacher Shortage (TCSHORT)....................................25082. Duration of Questionnaire Session, Showing Histogram and Associated Summary Statistics ........27183. Duration of Total Student Working Time, with Histogram and Associated Summary Statistics ....27284. Item Parameter Differences for the Items in Cluster One ..............................................................28285. Item Parameter Differences for the Items in Cluster Two ..............................................................28386. Item Parameter Differences for the Items in Cluster Three ...........................................................28387. Item Parameter Differences for the Items in Cluster Four .............................................................28388. Item Parameter Differences for the Items in Cluster Five ..............................................................28489. Item Parameter Differences for the Items in Cluster Six ................................................................28490. Item Parameter Differences for the Items in Cluster Seven ............................................................28491. Item Parameter Differences for the Items in Cluster Eight ............................................................28592. Item Parameter Differences for the Items in Cluster Nine .............................................................285
PIS
A 2
00
0 t
ech
nic
al r
epor
t
12
List of Tables
1. Test Development Timeline..............................................................................................................222. Distribution of Reading Items by Text Structure and by Item Type.................................................273. Distribution of Reading Items by Type of Task (Process) and by Item Type ....................................274. Distribution of Reading Items by Text Type and by Item Type........................................................275. Distribution of Reading Items by Context and by Item Type ..........................................................286. Items from IALS ..............................................................................................................................287. Distribution of Mathematics Items by Overarching Concepts and by Item Type.............................288. Distribution of Mathematics Items by Competency Class and by Item Type ...................................299. Distribution of Mathematics Items by Mathematical Content Strands and by Item Type................2910. Distribution of Mathematics Items by Context and by Item Type ...................................................2911. Distribution of Science Items by Science Processes and by Item Type..............................................3012. Distribution of Science Items by Science Major Area and by Item Type ..........................................3013. Distribution of Science Items by Science Content Application and by Item Type ............................3014. Distribution of Science Items by Context and by Item Type ............................................................3115. Stratification Variables.....................................................................................................................4516. Schedule of School Sampling Activities............................................................................................5117. Percentage of Correct Answers in English and French-Speaking Countries or Communities for
Groups of Test Units with Small or Large Differences in Length of Stimuli in the Source Languages........................................................................................................................................65
18. Correlation of Linguistic Difficulty Indicators in 26 English and French Texts ...............................6619. Percentage of Flawed Items in the English and French National Versions .......................................6720. Percentage of Flawed Items by Translation Method ........................................................................6821. Differences Between Translation Methods .......................................................................................6922. Non-Response Classes .....................................................................................................................9223. School and Student Trimming .........................................................................................................9524. Some Examples of ∆ ......................................................................................................................11525. Contribution of Model Terms to (Co) Variance.............................................................................12026. Coefficients of the Three Variance Components (General Case)....................................................12227. Coefficients of the Three Variance Components (Complete Case) .................................................12328. Estimates of Variance Components (x100) in the United States ....................................................12329. Variance Components (%) for Mathematics in the Netherlands....................................................12430. Estimates of ρ3 for Mathematics Scale in the Netherlands.............................................................12431. Sampling and Coverage Rates .......................................................................................................13532. School Response Rates Before Replacements.................................................................................13833. School Response Rates After Replacements...................................................................................13934. Student Response Rates After Replacements .................................................................................14035. Design Effects and Effective Sample Sizes for the Mean Performance on the Combined Reading
Literacy Scale.................................................................................................................................14336. Design Effects and Effective Sample Sizes for the Mean Performance on the Mathematical Literacy Scale .......14437. Design Effects and Effective Sample Sizes for the Mean Performance on the Scientific Literacy Scale...................14538. Design Effects and Effective Sample Sizes for the Percentage of Students at Level 3 on the
Combined Reading Literacy Scale .................................................................................................14639. Design Effects and Effective Sample Sizes for the Percentage of Students at Level 5 on the
Combined Reading Literacy Scale .................................................................................................14740. Number of Sampled Students by Country and Booklet .................................................................15041. Reliability of the Three Domains Based Upon Unconditioned Unidimensional Scaling .................15242. Latent Correlation Between the Three Domains............................................................................15343. Correlation Between Scales............................................................................................................15344. Final Reliability of the PISA Scales ................................................................................................15445. Average Number of Not-Reached Items and Missing Items by Booklet and Testing Session .........156
Tab
le o
f co
nte
nts
13
46. Average Number of Not-Reached Items and Missing Items by Country and Testing Session ........15747. Distribution of Not-Reached Items by Booklet, First Testing Session ............................................15748. Distribution of Not-Reached Items by Booklet, Second Testing Session ........................................15849. Mathematics Means for Each Booklet and Country ......................................................................15850. Reading Means for Each Booklet and Country .............................................................................15951. Science Means for Each Booklet and Country ...............................................................................16052. Booklet Difficulty Parameters ........................................................................................................16153. Booklet Difficulty Parameters Reported on the PISA Scale ............................................................16154. Variance Components for Mathematics.........................................................................................16455. Variance Components for Science..................................................................................................16556. Variance Components for Retrieving Information .........................................................................16657. Variance Components for Interpreting Texts .................................................................................16758. Variance Components for Reflection and Evaluation ....................................................................16859. ρ3-Estimates for Mathematics .......................................................................................................16960. ρ3-Estimates for Science.................................................................................................................17061. ρ3-Estimates for Retrieving Information ........................................................................................17162. ρ3-Estimates for Interpreting Texts ................................................................................................17263. ρ3-Estimates for Reflection and Evaluation ...................................................................................17364. Summary of Inter-Country Rater Reliability Study........................................................................17465. Summary of Item Characteristics for Inter-Country Rater Reliability Study ..................................17566. Items with Less than 90 Per Cent Consistency ..............................................................................17667. Inter-Country Reliability Study Items with More than 25 Per Cent Inconsistency Rate in at Least
One Country .................................................................................................................................17768. Inter-Country Summary by Country..............................................................................................17869. The Potential Impact of Non-response on PISA Proficiency Estimates ..........................................18070. Performance in Reading by Grade in the Flemish Community of Belgium ....................................18371. Average of the School Percentage of Over-Aged Students ..............................................................18472. Results for the New Zealand Logistic Regression..........................................................................18873. Results for the United States Logistic Regression...........................................................................19374. Reliability of Parental Interest and Family Relations Scales ..........................................................22375. Reliability of Family Possessions Scales .........................................................................................22576. Reliability of Instruction and Learning Scales ...............................................................................22777. Reliability of School and Classroom Climate Scales ......................................................................23078. Reliability of Reading Habits Scales ..............................................................................................23279. Reliability of Motivation and Interest Scales .................................................................................23480. Reliability of Learning Strategies Scales.........................................................................................23681. Reliability of Learning Style Scales ................................................................................................23882. Reliability of Self-Concept Scales...................................................................................................24083. Reliability of Learning Confidence Scales......................................................................................24284. Reliability of Computer Familiarity or Information Technology Scales .........................................24485. Reliability of School Policies and Practices Scales .........................................................................24686. Reliability of School Climate Scales...............................................................................................24887. Reliability of School Resources Scales ...........................................................................................25088. Validity Study Results for France...................................................................................................26889. Validity Study Results for the United Kingdom .............................................................................26990. Responses to the School Co-ordinator Interview ...........................................................................27691. Number of Students per School Refusing to Participate in PISA Prior to Testing ..........................27792. Item Assignment to Reading Clusters ............................................................................................282
PIS
A 2
00
0 t
ech
nic
al r
epor
t
14
The Programme ForInternational StudentAssessment: An Overview
Ray Adams
The OECD Programme for International StudentAssessment (PISA) is a collaborative effortamong OECD Member countries to measurehow well 15-year-old young adults approachingthe end of compulsory schooling are prepared tomeet the challenges of today’s knowledgesocieties.1 The assessment is forward-looking:rather than focusing on the extent to which thesestudents have mastered a specific schoolcurriculum, it looks at their ability to use theirknowledge and skills to meet real-life challenges.This orientation reflects a change in curriculargoals and objectives, which are increasinglyconcerned with what students can do with whatthey learn at school.
The first PISA survey was conducted in 2000in 32 countries (including 28 OECD Membercountries) using written tasks answered inschools under independently supervised testconditions. Another 11 countries will completethe same assessment in 2002. PISA 2000surveyed reading, mathematical and scientificliteracy, with a primary focus on reading.Measures of attitudes to learning, andinformation on how students manage their ownlearning were also obtained in 25 countries aspart of an international option. The survey willbe repeated every three years, with the primaryfocus shifting to mathematics in 2003, science in2006 and back to reading in 2009.
In addition to the assessments, PISA 2000included Student and School Questionnaires tocollect data that could be used in constructingindicators pointing to social, cultural, economicand educational factors that are associated withstudent performance. Using the data taken fromthese two questionnaires, analyses linkingcontext information with student achievementcould address differences:• between countries in the relationships between
student-level factors (such as gender andsocial background) and achievement;
• in the relationships between school-levelfactors and achievement across countries;
• in the proportion of variation in achievementbetween (rather than within) schools, anddifferences in this value across countries;
• between countries in the extent to whichschools moderate or increase the effects ofindividual-level student factors and studentachievement;
• in education systems and national contextthat are related to differences in studentachievement across countries; and
• in the future, changes in any or all of theserelationships over time.Through the collection of such information at
the student and school level on a cross-nationally comparable basis, PISA addssignificantly to the knowledge base that waspreviously available from national officialstatistics, such as aggregate national statistics onthe educational programs completed and thequalifications obtained by individuals.
The ambitious goals of PISA come at a cost:PISA is both resource intensive and
1 In most OECD countries, compulsory schoolingends at age 15 or 16; in the United States it ends atage 17, and in Belgium, Germany and theNetherlands, at age 18 (OECD, 2001).
Chapter 1
methodologically highly complex, requiringintensive collaboration among manystakeholders. The successful implementation ofPISA depends on the use, and sometimes furtherdevelopment, of state-of-the-art methodologies.This report describes some of thosemethodologies, along with other features that
have enabled PISA to provide high quality datato support policy formation and review.
Figure 1 provides an overview of the centraldesign elements of PISA. The remainder of thisreport describes these design elements and theassociated procedures in more detail.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
16
Figure 1: PISA 2000 in a Nutshell
Sample size• More than a quarter of a million students, representing almost 17 million 15-year-olds enrolled in
the schools of the 32 participating countries, were assessed in 2000. Another 11 countries willadminister the same assessment in 2002.
Content• PISA 2000 covered three domains: reading literacy, mathematical literacy and scientific literacy.
• PISA 2000 looked at young people’s ability to use their knowledge and skills in order to meet real-life challenges rather than how well they had mastered a specific school curriculum.
• The emphasis was placed on the mastery of processes, the understanding of concepts, and the abilityto function in various situations within each domain.
• As part of an international option taken up in 25 countries, PISA 2000 collected information onstudents’ attitudes to learning.
Methods• PISA 2000 used pencil-and-paper assessments, lasting two hours for each student.
• PISA 2000 used both multiple-choice items and questions requiring students to construct their ownanswers. Items were typically organised in units based on a passage describing a real-life situation.
• A total of seven hours of assessment items was included, with different students taking differentcombinations of the assessment items.
• Students answered a background questionnaire that took about 30 minutes to complete and, as partof an international option, completed questionnaires on learning and study practices as well asfamiliarity with computers.
• School principals completed a questionnaire about their school.
Outcomes• A profile of knowledge and skills among 15-year-olds.
• Contextual indicators relating results to student and school characteristics.
• A knowledge base for policy analysis and research.
• Trend indicators showing how results change over time, once data become available from subsequentcycles of PISA.
Future assessments• PISA will continue in three-year cycles. In 2003, the focus will be on mathematics and in 2006 on
science. The assessment of cross-curricular competencies is being progressively integrated into PISA,beginning with an assessment of problem-solving skills in 2003.
Managing andImplementing PISA
The design and implementation of PISA 2000was the responsibility of an internationalConsortium led by the Australian Council forEducational Research (ACER). The otherpartners in this Consortium have been theNational Institute for Educational Measurement(Cito Group) in the Netherlands, the Service dePédagogie Expérimentale at Université de Liègein Belgium, Westat and the Educational TestingService (ETS) in the United States and theNational Institute for Educational Research(NIER) in Japan. Appendix 7 lists the manyConsortium staff and consultants who havemade important contributions to thedevelopment and implementation of the project.
The Consortium implements PISA within aframework established by a Board ofParticipating Countries (BPC) which includesrepresentation from all countries at senior policylevels. The BPC established policy priorities andstandards for developing indicators, forestablishing assessment instruments, and forreporting results. Experts from participatingcountries served on working groups linking theprogramme policy objectives with the bestinternationally available technical expertise inthe three assessment areas. These expert groupswere referred to as Functional Expert Groups(FEGs) (see Appendix 7 for members). Byparticipating in these expert groups andregularly reviewing outcomes of the groups’meetings, countries ensured that the instrumentswere internationally valid and that they tookinto account the cultural and educationalcontexts of the different OECD Membercountries, that the assessment materials hadstrong measurement potential, and that theinstruments emphasised authenticity andeducational validity.
Participating countries implemented PISAnationally through National Project Managers(NPMs), who respected common technical andadministrative procedures. These managersplayed a vital role in developing and validatingthe international assessment instruments andensured that PISA implementation was of highquality. The NPMs also contributed to theverification and evaluation of the survey results,analyses and reports.
The OECD Secretariat had overallresponsibility for managing the programme. Itmonitored its implementation on a day-to-daybasis, served as the secretariat for the BPC,fostered consensus building between thecountries involved, and served as the interlocutorbetween the BPC and the internationalConsortium.
This Report
This Technical Report does not report the resultsof PISA. The first results from PISA werepublished in December 2001 in Knowledge andSkills for Life (OECD, 2001) and a sequence ofthematic reports covering topics such as: SocialBackground and Student Achievement; TheDistribution and Impact of Strategies of Self-Regulated Learning and Self-Concept;Engagement and Motivation; and School FactorsRelated to Quality and Equity is planned forpublication in the coming months.
This Technical Report is designed to describethe technical aspects of the project at a sufficientlevel of detail to enable review and potentiallyreplication of the implemented procedures andtechnical solutions to problems. The report isbroken into five sections:• Section One—Instrument Design: Covers the
design and development of both thequestionnaires and achievement tests.
• Section Two—Operations: Covers theoperational procedures for the sampling andpopulation definitions, test administrationprocedures, quality monitoring and assuranceprocedures for test administration andnational centre operations, and instrumenttranslation.
• Section Three—Data Processing: Covers themethods used in data cleaning andpreparation, including the methods forweighting and variance estimation, scalingmethods, methods for examining inter-ratervariation and the data cleaning steps.
• Section Four—Quality Indicators andOutcomes: Covers the results of the scalingand weighting, reports response rates andrelated sampling outcomes and gives theoutcomes of the inter-rater reliability studies.The last chapter in this section summarises theoutcomes of the PISA 2000 data
An
ove
rvie
w
17
adjudication—that is, the overall analysis ofdata quality for each country.
• Section Five—Scale Construction and DataProducts: Describes the construction of thePISA 2000 described levels of proficiency andthe construction and validation ofquestionnaire-related indices. The finalchapter briefly describes the contents of thePISA 2000 database.
• Appendices: Detailed appendices of resultspertaining to the chapters of the report areprovided.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
18
Rea
der
’s g
uid
e
19
Reader’s Guide
List of abbreviations
The following abbreviations are used in this report:
ACER Australian Council for Educational ResearchAGFI Adjusted Goodness-of-Fit IndexBPC PISA Board of Participating CountriesBRR Balanced Repeated ReplicationCCC PISA 2000 questionnaire on self-regulated learningCFA Confirmatory Factor AnalysisCFI Comparative Fit IndexCIVED Civic Education StudyDIF Differential Item FunctioningENR Enrolment of 15-year-oldsETS Educational Testing ServiceFEG Functional Expert GroupI Sampling IntervalIALS International Adult Literacy SurveyICR Inter-Country Rater Reliability StudyIEA International Association for the Evaluation of Educational AchievementISCED International Standard Classification of EducationISCO International Standard Classification of OccupationsISEI International Socio-Economic IndexIT PISA questionnaire on computer familiarityMENR Enrolment for moderately small schoolsMOS Measure of sizeNCQM National Centre Quality MonitorNDP National Desired PopulationNEP National Enrolled PopulationNFI Normed Fit IndexNNFI Non-Normed Fit IndexNPM National Project ManagerPISA Programme for International Student AssessmentPPS Probability Proportional to SizePSU Primary Sampling UnitsRMSEA Root Mean Square Error of ApproximationRN Random NumberSC School Co-ordinatorSD Standard DeviationSEM Structural Equation ModellingSES Socio-Economic StatusSQM School Quality MonitorTA Test AdministratorTAG Technical Advisory GroupTCS Target Cluster SizeTIMSS Third International Mathematics and Science StudyTIMSS-R Third International Mathematics and Science Study - RepeatTOEFL Test of English as a Foreign LanguageVENR Enrolment for very small schoolsWLE Weighted Likelihood Estimates
Further Information
For further information on the PISA assessment instruments, the PISA database and methods used seealso:
Knowledge and Skills for Life – First Results from PISA 2000 (OECD, 2001);
Knowledge and Skills for Life – A New Framework for Assessment (OECD, 1999a);
Manual for the PISA 2000 Database (OECD, 2002a);
Sample Tasks from the PISA 2000 Assessment – Reading, Mathematical and Scientific Literacy (OECD, 2002b); and
The PISA Web site (www.pisa.oecd.org).
Test DesigN and Test Development
Margaret Wu
Test Scope and Test Format
PISA 2000 had three subject domains, withreading as the major domain, and mathematicsand science as minor domains. Student achievementin reading was assessed using 141 itemsrepresenting approximately 270 minutes oftesting time. The mathematics assessmentconsisted of 32 items, and the science assessmentconsisted of 35 items, representingapproximately 60 minutes of testing time foreach.
The materials used in the main study wereselected from a larger pool of approximately600 items tested in a field trial conducted in allcountries one year prior to the main study. Thepool included 16 items from the InternationalAdult Literacy Survey (IALS, OECD, 2000) andthree items from the Third InternationalMathematics and Science Study (TIMSS, Beatonet al., 1996). The IALS items were included toallow for possible linking of PISA results toIALS.1
PISA 2000 was a paper-and-pencil test, witheach student undertaking two hours of testing(i.e., answering one of the nine booklets). Thetest items were multiple-choice, short answer, andextended response. Multiple-choice items wereeither standard multiple-choice with a limitednumber (usually four or five) of responses fromwhich students were required to select the bestanswer, or complex multiple-choice presentingseveral statements from which students wererequired to give one of several possible responses(true/false, correct/incorrect, etc.). Closed
1 Note that after consideration by the Functional ExpertGroups (FEGs), National Project Managers (NPMs)and the Board of Participating Countries (BPC), a linkwith TIMSS was not pursued in the main study.
constructed-response items generally requiredstudents to construct a response within very limitedconstraints, such as mathematics items requiringa numeric answer, or items requiring a word orshort phrase, etc. Short response items were similarto closed constructed-response items, but had awide range of possible responses. Open constructed-response items required more extensive writing,or showing a calculation, and frequently includedsome explanation or justification.
Pencils, erasers, rulers, and, in some cases,calculators, were provided. The Consortiumrecommended that calculators be provided incountries where they were routinely used in theclassroom. National centres decided whethercalculators should be provided for their studentson the basis of standard national practice. Noitems in the pool required a calculator, but someitems involved solution steps for which the use ofa calculator could facilitate computation. Indeveloping the mathematics items, test developerswere particularly mindful to ensure that the itemswere as calculator-neutral as possible.
Timeline
The project started in January 1998, when someinitial conception of the frameworks wasdiscussed. The formal process of test develop-ment began after the first Functional ExpertGroups’ (FEGs) meetings in March 1998. Themain phase of the test item development finishedwhen the items were distributed for the field trialin November 1998. During this eight-monthperiod, intensive work was carried out in writingand reviewing items, and on pre-pilot activities.
section one: instrument design
Chapter 2
PIS
A 2
00
0 t
ech
nic
al r
epor
t
22
The field trial for most countries took placebetween February and July 1999, after whichitems were selected for the main study anddistributed to countries in December 1999.
Table 1 shows the PISA 2000 testdevelopment timeline.
Assessment Frameworkand Test Design
Development of the AssessmentFrameworks
The Consortium, through the test developers andexpert groups, and in consultation with nationalcentres, developed assessment frameworks forreading, mathematics and science. The frameworkswere endorsed by the Board of ParticipatingCountries (BPC) and published by the OECD in1999 in Measuring Student Knowledge and Skills:A New Framework for Assessment (OECD, 1999a).
The frameworks presented the direction beingtaken by the PISA assessments. They definedeach assessment domain, described the scope ofthe assessment, the number of items required toassess each component of a domain and thepreferred balance of question types, andsketched the possibilities for reporting results.
Test Design
The 141 main study reading items were organisedinto nine separate clusters, each with an estimatedadministration time of 30 minutes. The 32 mathem-atics items and the 35 science items were organisedinto four 15-minute mathematics clusters and four15-minute science clusters respectively. These
clusters were then combinedin various groupings toproduce nine linked two-hour test booklets.
Using R1 to R9 to denotethe reading clusters, M1 to M4to denote the mathematicsclusters and S1 to S4 todenote the science clusters,the allocation of clusters tobooklets is illustrated inFigure 2.
The BPC requested thatthe majority of bookletsbegin with reading itemssince reading was the majorarea. In the design, seven ofthe nine booklets begin withreading. The BPC furtherrequested that students notbe expected to switchbetween assessment areas,and so none of the booklets
requires students to return to any area afterhaving left it.
Reading items occur in all nine booklets, andthere are linkages between the reading in allbooklets. This permits all sampled students to beassigned reading scores on common scales.Mathematics items occur in five of the ninebooklets, and there are links between the fivebooklets, allowing mathematics scores to bereported on a common scale for five-ninths ofthe sampled students. Similarly, science materialoccurs in five linked booklets, allowing sciencescores to be reported on a common scale forfive-ninths of the sampled students.
In addition to the nine two-hour booklets, aspecial one-hour booklet, referred to as Bookletzero (or the SE booklet) was prepared for use inschools catering exclusively to students with specialneeds. The SE booklet was shorter, and designed tobe somewhat easier than the other nine booklets.
The two-hour test booklets, sometimesreferred to in this report as ‘cognitive booklets’
Table 1: Test Development Timeline
Activity Date
Develop frameworks January-September 1998
Item submission from countries May-July 1998
Develop items March-November 1998
Pre-pilot in Australia October and November 1998
Distribution of field trial material November 1998
Translation into national languages November 1998-January 1999
Pre-pilot in the Netherlands February 1999
Field trial marker training February 1999
Field trial in participating countries February-July 1999
Select items for main study July-November 1999
Cultural Review Panel meeting October 1999
Distribute main study material December 1999
Main study marker training February 2000
Main study in participating countries February-October 2000
Inst
rum
ent
dei
sgn
23
to distinguish them from the questionnaires,were arranged in two one-hour parts, each madeup of two of the 30-minute time blocks from thecolumns in the above figure. PISA’s proceduresprovided for a short break to be taken betweenadministration of the two parts of the testbooklet, and a longer break to be taken betweenadministration of the test and the questionnaire.
Test Development Team
The core of the test development team consistedof staff from the test development sections of theAustralian Council for Educational Research(ACER) in Melbourne and the Cito Group in theNetherlands. Many others, including members ofthe FEGs, translators and item reviewers atnational centres, made substantial contributionsto the process of developing the tests.
Test Development Process
Following the development of the assessmentframework, the process of test developmentincluded: calling for submissions of test itemsfrom participating countries; writing andreviewing items; pre-piloting the test material;preparing marking guides and marker trainingmaterial; and selecting items for the main study.
Item Submissions fromParticipating Countries
An international comparative study should drawitems from a wide range of cultures andlanguages. Thus, at the start of the PISA 2000
Booklet Block 1 Block 2 Block 3 Block 41 R1 R2 R4 M1 M2
2 R2 R3 R5 S1 S2
3 R3 R4 R6 M3 M4
4 R4 R5 R7 S3 S4
5 R5 R6 R1 M2 M3
6 R6 R7 R2 S2 S3
7 R7 R1 R3 R8
8 M4 M2 S1 S3 R8 R9
9 S4 S2 M1 M3 R9 R8
Figure 2: Test Booklet Design for the Main Survey
test development process, the Consortium calledfor participating countries to submit items.
A document outlining submission guidelineswas prepared stating the purpose of the project,and the working definition of each subjectdomain as drafted by the FEGs. In particular, thedocument described PISA’s literacy orientationand its aim of tapping students’ preparedness forlife. The Consortium requested authenticmaterials in their original forms, preferably fromthe news media and original published texts. Thesubmission guidelines included a list ofvariables—such as, text types and formats,response formats and contexts (and situations)of the materials—according to which countrieswere requested to classify their submissions. Astatement on copyright status was also sought.Countries submitted items either directly to theConsortium, or through the FEG members whoadvised on their appropriateness before sendingthem on. Having FEG members act as liaisonserved the purpose of addressing translationissues, as many FEG members spoke languagesother than English, and could deal directly withitems written in languages other than English.
A total of 19 countries contributed materialsfor reading, 12 countries contributed materialsfor mathematics, and 13 countries contributedmaterials for science. For a full list of countryand language source for the main study items,see Appendix 1 (for reading), Appendix 2 (formathematics) and Appendix 3 (for science).
Item Writing
The Consortium test development team reviewedcontributed items from countries to ensure thatthey were consistent with the frameworks andhad no obvious technical flaws. The testdevelopers also wrote many items from scratch toensure that the pool satisfied the frameworkspecifications for the composition of theassessment instruments. There were three separateteams of test developers for the three domains,but some integrated units were developed thatincluded both reading and science items.2 In all,660 minutes of reading material, 180 minutes ofmathematics material, 180 minutes of science
2 An Integrated Unit contained a stimulus text thatusually had a scientific content. A number of readingitems and science items followed the stimulus.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
24
material, and 60 minutes of integrated materialwere developed for the field trial.3
To ensure that translation did not change theintentions of the questions or the difficulty of theitems, the test developers provided translationnotes where appropriate, to underline potentialdifficulties with translations and adaptations andto avoid possible misinterpretations of theEnglish or French words.4 Test developers alsoprovided a summary of the intentions of theitems. Suggestions for adaptations were alsomade to indicate the scope of possible changes inthe translated texts and items. During thetranslation process, the test developersadjudicated lists of country adaptations toensure that the changes did not alter what theitems were testing.
The Item Review Process
The field trial was preceded by an extensivereview process of the items including panellingby the test development team, national reviewsin each country, reviews by the FEGs, pre-pilotsin Australian and Dutch schools, and closescrutiny by the team of translators.
Each institution (ACER and Cito Group)routinely subjected all item development to aninternal item panelling process5 that involvedstaff from the PISA test development team andtest developers not specifically involved in thePISA project. ACER and Cito Group alsoexchanged items and reviewed and providedcomments on them.
3 However, for the main study, no integrated unitswere selected.4 The PISA instruments were distributed in matchingEnglish and French source versions.5 In test development literature, item panelling issometimes referred to as item shredding or a cognitivewalk-through.
After item panelling, the items were sent infour batches6 to participating countries fornational review. The Consortium requestedratings from national subject matter experts ofeach item and stimulus according to (i) students’exposure to the content of the item (ii) itemdifficulty (iii) cultural concerns (iv) other biasconcerns (v) translation problems, and (vi) anoverall priority rating for inclusion of the item.In all, 27 countries provided feedback; theratings were collated by item and used by thetest development team to revise and select itemsfor the field trial.
Before the field trial, the reading FEG met onfour occasions (March, May, August, October1998), and the mathematics and science groupsmet on three occasions (March, May, October1998) to shape and draft the frameworkdocuments and review items in detail. Inaddition to bringing their expertise in aparticular discipline, members of the expertgroups came from a wide range of languagegroups and countries, cultures and educationsystems. For a list of members of the expertgroups, see Appendix 7.
Pre-pilots and Translation Input
In the small-scale pre-pilot conducted inAustralian and Dutch schools, about 35 studentresponses on average were obtained for eachitem and used to check the clarity of thestimulus and the items, and to construct sampleanswers in the Marking Guides (see nextsection). The pre-pilot also provided someindications about the necessary answer time. Asa result of the pilots, it was estimated that, onaverage, an item required about two minutes tocomplete.
Translation also provided another indirect,albeit effective, review. Translations were madefrom English to French and vice versa to providethe national translation teams with two sourceversions of all materials (see Chapter 5) and theteam often pointed out useful information suchas typographical errors, ambiguities andtranslation difficulties, and some cultural issues.The group of experts reviewing the translatedmaterial detected some remaining errors in bothversions. Finally one French-speaking member
6 Referred to as item bundles.
Inst
rum
ent
dei
sgn
25
and one English-speaking member of the readingFEG reviewed the items for linguistic accuracy.
Preparing Marking Guides and MarkerTraining Material
A Marking Guide was prepared for each subjectarea with items that required markers to coderesponses. The guide emphasised that markerswere to code rather than score responses. Thatis, the guides separated different kinds ofpossible responses, which did not all necessarilyreceive different scores. The actual scoring wasdone after the field trial data were analysed,which provided information on the appropriatescores for each different response category.7
The Marking Guide was a list of responsecodes with descriptions and examples, but aseparate training workshop document was alsoproduced for each subject area. Marker trainingfor the field trial took place in February 1999.Markers used the workshop material as exercisesto practise using the Marking Guides. Workshopmaterial contained additional student responses:the correct response codes were in hidden texts,so that the material could be used as a test formarkers and as reference material. In addition, aMarker Recruitment Kit was produced withmore student responses to be used as a screeningtest when recruiting markers.
PISA Field Trial and ItemSelection for the MainStudy
The PISA Field Trial was carried out in mostcountries in the first half of 1999. An average ofabout 150 to 200 student responses to each itemwere collected in each country. During the fieldtrial, the Consortium set up a marker queryservice. Countries were encouraged to sendqueries to the service so that a commonadjudication process was consistently applied toall markers’ questions.
Between July and November 1999, nationalreviewers, the test development team, FEGs anda Cultural Review Panel reviewed and selecteditems.
• Each country received an item review formand the country’s field trial report (seeChapter 9). The NPM, in consultation withnational committees, completed the forms.
• Test developers met in September 1999 to(i) identify items that were totally unsuitablefor the main study, and (ii) recommend thebest set for the main study, ensuring a balancein all aspects specified in the frameworks.FEG chairs were also invited to join the testdevelopers.
• The FEGs met with the test developers inOctober 1999 and reviewed and modified therecommended set of selected items. A final setof items was presented to the National ProjectManagers Meeting in November 1999.
• A Cultural Review Panel met at the end ofOctober 1999 to ensure that the selected itemswere appropriate for all cultures and wereunbiased for any particular group of students.Cultural appropriateness was defined to meanthat assessment stimuli and items wereappropriate for 15-year-old students in allparticipating countries, that they were drawnfrom a cultural milieu that, overall, wasequally familiar to all students, and that thetests did not violate any national culturalvalues or positions. The Cultural ReviewPanel also considered issues of gender andappropriateness across different socio-economic groups. For the list of members inthe Cultural Review Panel, see Appendix 7.Several sources were used in the item review
process after the field trial.• Participant feedback was collected during the
translation process, at the marker trainingmeeting, and throughout each country’smarking activity for the field trial. TheConsortium had compiled information aboutthe wording of items, the clarity of theMarking Guides, translation difficulties andproblems with graphics and printing whichwere all considered when items were revised.Furthermore, countries were asked once againto rate and comment on field trial items for (i) curriculum relevance, (ii) interest to 15-year-olds, (iii) cultural, gender, or other bias, and(iv) difficulties with translation and marking.
• A report on problematic items was prepared.The report flagged items that were easier orharder than expected; had a positive non-key point-biserial or a negative key point-biserial;
7 It is worth mentioning here that as data entry wascarried out using KeyQuest®, many short-responseswere entered directly, which saved time and made itpossible to capture students’ raw responses.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
26
non-ordered ability for partial credit items; ora poor fit. While some of these problemscould be traced to the occasional translationor printing error, this report gave somestatistical information on whether there was asystematic problem with a particular item.
• A marker reliability study providedinformation on marker agreement for codingresponses to open-ended items and was usedto revise items and/or Marking Guides, or toreject items altogether.
• Differential Item Functioning (DIF) analyseswith regard to gender, reading competence andsocio-economic status (SES). Items exhibitinglarge DIF were candidates for removal.
• Item attributes. Test developers monitoredclosely the percentage of items in eachcategory for various attributes. For example,open-ended items were limited to thepercentages set by the BPC, and covereddifferent aspects of the area in the proportionsspecified in the framework.
Item Bank Software
To keep track of all the item information and tofacilitate item selection, the Consortiumprepared item bank software that managed allphases of the item development. The databasestored text types and formats, item formats,contexts, and all other framework classifications,and had a comments field for the changes madethroughout the test development. After the fieldtrial, the item bank stored statistical informationsuch as difficulty and discrimination indices. Theselection process was facilitated by a modulethat let test developers easily identify thecharacteristics of the selected item pool. Itemswere regularly audited to ensure that theinstruments that were developed satisfiedframework specifications.
PISA 2000 Main Study
After items for the PISA 2000 Main Study wereselected from the field trial pool, items andMarking Guides were revised where necessary,clusters and booklets were made, and all mainstudy test instruments were prepared for dispatchto national centres. In particular, more extensiveand detailed sample responses were added to theMarking Guides using material derived from thefield trial student booklets. The collection ofsample responses was designed to anticipate therange of possible responses to each item requiringexpert marking, to ensure consistency in codingstudent responses, and therefore to maximise thereliability of marking. These materials were sentto countries on December 31, 1999.
Marker training materials were preparedusing sample responses collected from field trialbooklets. Double-digit coding was developed todistinguish between the score and the responsecode and used for 10 mathematics and 12 scienceitems. The double-digit codes alloweddistinctions to be retained between responsesthat were reflective of quite different cognitiveprocesses and knowledge. For example, if analgebraic approach or a trial-and-error approachwas used to arrive at a correct answer, a studentcould score a ‘1’ for an item using eithermethod, and the method would be reflected inthe second digit. The double-digit codingcaptures different problem-solving approachesby using the first digit to indicate the score andthe second digit to indicate method or approach.
Marker training meetings took place inFebruary 2000. Members of the test develop-ment team conducted separate marker trainingsessions for each of the three test domains.
As in the field trial, a marker query servicewas established so that any queries from markingcentres in each country could be directed to therelevant Consortium test developer, andconsistent advice could be quickly provided.
Finally, when cleaning and analysis of datafrom the main study had progressed sufficientlyto permit item analysis on all items and thegeneration of item-by-country interaction data,the test developers provided advice on itemswith poor measurement properties, and wherethe interactions were potentially problematic.This advice was one of the inputs to thedecisions about item deletion at either theinternational or national level (see Chapter 13).
PISA 2000 TestCharacteristics
An important goal in building the PISA 2000tests was to ensure that the final form of thetests met the specifications as given in the PISA2000 framework (OECD 1999a). For reading,
Table 2: Distribution of Reading Items by Text Structure and by Item Type
Text Structure Number Multiple- Complex Closed Open Shortof Items Choice Multiple- Constructed- Constructed- Response
Choice Response Response
Continuous 89 (2) 42 3 3 34 (2) 7
Non-continuous 52 (7) 14 (2) 4 (1) 12 9 (1) 13 (3)
Total 141 (9) 56 (2) 7 (1) 15 43 (3) 20
Note: The numbers in parentheses indicate the number of items deleted after the main study analysis.
Table 3: Distribution of Reading Items by Type of Task (Process) and by Item Type
Type of Task Number Multiple- Complex Closed Open Short(Process) of Items Choice Multiple- Constructed- Constructed- Response
Choice Response Response
Interpreting 70 (5) 43 (2) 3 (1) 5 14 (2) 5
Reflecting 29 3 2 - 23 1
Retrieving Information 42 (4) 10 2 10 6 (1) 14 (3)
Total 141 (9) 56 (2) 7 (1) 15 43 (3) 20 (3)
Note: The numbers in parentheses indicate the number of items deleted after the main study analysis.
this meant that an appropriate distributionacross: Text Structure, Type of Task (Process),Text Type, Context and Item Type was required.The attained distributions, which closelymatched the goals specified in the framework,are shown by framework category and itemtype8 in Tables 2, 3, 4 and 5.
Table 4: Distribution of Reading Items by Text Type and by Item Type
Text Type Number Multiple- Complex Closed Open Shortof Items Choice Multiple- Constructed- Constructed- Response
Choice Response Response
Advertisements 4 (3) - - - 1 3 (3)Argumentative/
Persuasive 18 (1) 7 1 2 8 (1) -Charts/Graphs 16 (1) 8 (1) - 2 3 3Descriptive 13 (1) 7 1 - 4 (1) 1Expository 31 17 1 - 9 4Forms 8 1 1 4 1 1 Injunctive 9 3 - 1 5 -Maps 4 1 - - 1 2Narrative 18 8 - - 8 2Schematics 5 2 2 - - 1Tables 15 (3) 2 (1) 1 (1) 6 3 (1) 3
Total 141 (9) 56 (2) 7 (1) 15 43 (3) 20 (3)
Note: The numbers in parentheses indicate the number of items deleted after the main study analysis.
8 Item types are explained in section “Test Scope andTest Format”.
Inst
rum
ent
dei
sgn
27
Table 6: Items from IALS
Unit Name Unit and Item ID
Allergies Explorers R239Q01Allergies Explorers R239Q02Bicycle R238Q02Bicycle R238Q01Contact Employer R246Q01Contact Employer R246Q02Hiring Interview R237Q01Hiring Interview R237Q03Movie Reviews R245Q01Movie Reviews R245Q02New Rules R236Q02New Rules R236Q01Personnel R234Q02Personnel R234Q01Warranty Hot Point R241Q02Warranty Hot Point R241Q01
PIS
A 2
00
0 t
ech
nic
al r
epor
t
28
Table 5: Distribution of Reading Items by Context and by Item Type
Text Context Number Multiple- Complex Closed Open Shortof Items Choice Multiple- Constructed- Constructed- Response
Choice Response Response
Educational 39 (3) 22 4 1 4 8 (3)Occupational 22 4 1 4 9 4Personal 26 10 - 3 10 3Public 54 (6) 20 (2) 2 (1) 7 20 (3) 5
Total 141 (9) 56 (2) 7 (1) 15 43 (3) 20 (3)
Note: The numbers in parentheses indicate the number of items deleted after the main study analysis.
For reading it was also determined that a linkto the IALS study should be explored. To makethis possible, 16 IALS items were included in thePISA 2000 item pool. The IALS link items arelisted in Table 6.
For mathematics an appropriate distributionacross: Overarching Concepts (mainmathematical theme), Competency Class,Mathematical Content Strands, Context andItem Type was required. The attaineddistributions, which closely matched the goalsspecified in the framework, are shown byframework category and item type in Tables 7, 8, 9 and 10.
Table 7: Distribution of Mathematics Items by Overarching Concepts and by Item Type
Overarching Number Multiple- Closed OpenConcepts of Items Choice Constructed- Constructed-
Response Response
Growth and Change 18 (1) 6 (1) 9 3Space and Shape 14 5 9 -
Total 32 (1) 11 (1) 18 3
Note: The numbers in parentheses indicate the number of items deleted after the main study analysis.
Inst
rum
ent
dei
sgn
29
Table 8: Distribution of Mathematics Items by Competency Class and by Item Type
Competency Class Number Multiple- Closed Openof Items Choice Constructed- Constructed-
Response Response
Class 1 10 4 6 -
Class 2 20 (1) 7 (1) 11 2
Class 3 2 - 1 1
Total 32 (1) 11 (1) 18 3
Note: The numbers in parentheses indicate the number of items deleted after the main study analysis.
Table 9: Distribution of Mathematics Items by Mathematical Content Strands and by Item Type
Mathematical Number Multiple- Closed OpenContent of Items Choice Constructed- Constructed-Strands Response Response
Algebra 5 - 4 1Functions 5 4 - 1Geometry 8 3 5 -Measurement 7 (1) 3 (1) 4 -Number 1 - 1 -Statistics 6 1 4 1
Total 32 (1) 11 (1) 18 3
Note: The numbers in parentheses indicate the number of items deleted after the main study analysis.
Table 10: Distribution of Mathematics Items by Context and by Item Type
Context Number Multiple- Closed Openof Items Choice Constructed- Constructed-
Response Response
Community 4 - 2 2Educational 6 2 3 1Occupational 3 1 2 -Personal 12 (1) 6 (1) 6 -Scientific 7 2 5 -
Total 32 (1) 11 (1) 18 3
Note: The numbers in parentheses indicate the number of items deleted after the main study analysis.
For science an appropriate distribution across:Processes, Major Area, Content Application,Context and Item Type was required. Theattained distributions, which closely matched thegoals specified in the framework, are shown byframework category and item type inTables 11, 12, 13 and 14.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
30
Table 11: Distribution of Science Items by Science Processes and by Item Type
Science Processes Number Multiple- Complex Closed Open Shortof Items Choice Multiple- Constructed- Constructed- Response
Choice Response Response
Communicating to Others 3 - - - 3 -Valid Conclusions from Evidence and Data
Demonstrating 15 9 1 - 3 2Understanding of Scientific Knowledge
Drawing and Evaluating 7 (1) 1 2 (1) 1 3 -Conclusions
Identifying Evidence and Data 5 2 1 - 2 -
Recognising Questions 5 1 3 - 1 -
Total 35 (1) 13 7 (1) 1 12 2
Note: The numbers in parentheses indicate the number of items deleted after the main study analysis.
Table 12: Distribution of Science Items by Science Major Area and by Item Type
Science Major Number Multiple- Complex Closed Open ShortArea of Items Choice Multiple- Constructed- Constructed- Response
Choice Response Response
Earth and Environment 13 3 2 1 6 1Life and Health 13 6 1 - 5 1Technology 9 1) 4 4 (1) - 1 -
Total 35 (1) 13 7 (1) 1 12 2
Note: The numbers in parentheses indicate the number of items deleted after the main study analysis.
Table 13: Distribution of Science Items by Science Content Application and by Item Type
Science Content Number Multiple- Complex Closed Open ShortApplication of Items Choice Multiple- Constructed- Constructed- Response
Choice Response Response
Atmospheric Change 5 - 1 1 3 -Biodiversity 1 1 - - - -Chemical and Physical Change 1 - - - 1 -Earth and Universe 5 3 1 - - 1Ecosystems 3 2 - - 1 -Energy Transfer 4 (1) - 2 (1) 2 -Form and Function 3 1 - - 2 -Genetic Control 2 1 1 - - -Geological Change 1 - - - 1 -Human Biology 3 1 - - 2 -Physiological Change 1 - - - - 1Structure of Matter 6 4 2 - - -
Total 35 (1) 13 7 (1) 1 12 2
Note: The numbers in parentheses indicate the number of items deleted after the main study analysis.
Inst
rum
ent
dei
sgn
31
Table 14: Distribution of Science Items by Context and by Item Type
Context Number Multiple- Complex Closed Open Shortof Items Choice Multiple- Constructed- Constructed- Response
Choice Response Response
Global 16 4 3 1 7 1
Historical 4 2 - - 2 -
Personal 8 4 2 - 2 -
Public 7 (1) 3 2 (1) - 1 1
Total 35 (1) 13 7 (1) 1 12 2
Note: The numbers in parentheses indicate the number of items deleted after the main study analysis.
Student and SchoolQuestionnaire Development
Adrian Harvey-Beavis
Overview
A Student and a School Questionnaire were usedin PISA 2000 to collect data that could be usedin constructing indicators pointing to social,cultural, economic and educational factors thatare thought to influence, or to be associatedwith, student achievement. PISA 2000 did notinclude a teacher questionnaire.
Using the data taken from these twoquestionnaires, analyses linking context andinformation with student achievement couldaddress differences:• between countries in the relationships between
student-level factors (such as gender andsocial background) and achievement;
• between countries in the relationships betweenschool-level factors and achievement;
• in the proportion of variation in achievementbetween (rather than within) schools, anddifferences in this value across countries;
• between countries in the extent to whichschools moderate or increase the effects ofindividual-level student factors and studentachievement;
• in education systems and national contextsthat are related to differences in studentachievement among countries; and
• in any or all of these relationships over time.In May 1997 the OECD Member countries
through the Board of Participating Countries (BPC)established a set of priorities and their relativeindividual importance to guide the development ofthe PISA context questionnaires. These werewritten into the project’s terms of reference. For theStudent Questionnaire, this included the following
priorities, in descending order:• basic demographics (date of birth and
gender);• global measures of socio-economic status
(SES);• student description of school/instructional
processes;• student attitudes towards reading and reading
habits;• student access to educational resources
outside school (e.g., books in the home, aplace to study);
• institutional patterns of participation andprogramme orientation (e.g., the student’seducational pathway and currenttrack/stream);
• language spoken in the home;• nationality (e.g., country of birth of student
and parents, time of immigration); and• student expectations (e.g., career and
educational plans).The Member countries considered that all of
these priorities needed to be measured in eachcycle of PISA. Two other measures in-depthmeasure of student’s socio-economic status andthe opportunity to learn were also consideredimportant and were selected for inclusion asfocus topics for PISA 2000.
The BPC requested that the StudentQuestionnaire be limited in length toapproximately 30 minutes.
Chapter 3
For the School Questionnaire, the BPCestablished the following set of priorities, indescending order of importance:• quality of the school’s human and material
resources (e.g., student-teacher ratio,availability of laboratories, quality ofteachers, teacher qualifications);
• global measures of school-level SES (studentcomposition);
• school-level variables on instructional context;• institutional structure/type;• urbanisation/community type;• school size;• parental involvement; and• public/private control and funding.
The BPC considered that these prioritiesneeded to be covered in each cycle of PISA. Fourother measures considered important andselected for inclusion as focus topics for the PISA2000 School Questionnaire were listed in orderof priority:• school policies and practices;• school socio-economic status and quality of
human and material resources (same rankingas staffing patterns);
• staffing patterns, teacher qualifications andprofessional development; and
• centralisation/decentralisation of decision-making and school autonomy.The BPC requested that the School
Questionnaire be limited in length toapproximately 20 minutes.
Questions were selected or developed for in-clusion in the PISA context questionnaires fromthe set of BPC priorities. The choice of questions,and the development of new ones, was guided bya questionnaire framework document, whichevolved over the period of the first cycle ofPISA.1 This document indicated what PISA couldor could not do because of its design limitations,and provided a way to understand key conceptsused in the questionnaires. As the projectevolved, the reporting needs came to play anincreasingly important role in defining andrefining BPC priorities. In particular, the needsof the initial report and the proposed thematicreports helped to clarify the objectives of thecontext questionnaires.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
34
Development Process
The Questions
The first cycle of the PISA data collection couldnot always benefit from previous PISA instrumen-tation to guide the development of the contextquestionnaires. The approach therefore was toconsider each BPC priority and then scaninstruments used in other studies, particularlythose that were international, for candidatequestions. Experts and Consortium membersalso developed questions. A pool of questionswas developed and evaluated by experts on thebasis of:• how well questions would measure the
dimensions they purported to tap;• breadth of coverage of the concept;• ease of translation;• anticipated ease of use for the respondents;• cultural appropriateness; and• face validity.
National centres were consulted extensively be-fore the field trial to ensure meeting a maximumnumber of criteria. Further review was undertakenafter the consultation and a set of questions wasdecided upon for the field trial and reviewed forhow well they covered BPC priorities. In order totrial as many questionnaire items as possible, threeversions of the Student Questionnaire were usedin the field trial, but only one version of the SchoolQuestionnaire. This was due partly to the smallnumber of field trial schools and to the fact thatthe domains to be covered were less challengingthan some in the Student Questionnaire.
While the question pool was being developed,the questionnaire framework was being revised.This was reviewed and developed on an ongoingbasis for providing guidance to the theoreticaland practical limits of the questionnaires.
To obtain further information about theresponse of students to the field trial version ofthe questionnaire, School Quality Monitors (seeChapter 7) interviewed students after they hadcompleted them. This feedback was incorporatedinto an extensive review process after the fieldtrial. For the main study, further review andconsultation was undertaken involving theanalysis of data captured during the field trialwhich gave an indication of:• the distribution of values, including levels of
missing data;
1 The questionnaire framework was not published bythe OECD but is available as a project workingdocument.
• how well variables hypothesised to beassociated with achievement correlated with,in particular, reading achievement;
• how well scales had functioned; and• whether there had been problems with specific
questions within individual countries.Experts and national centres were again
consulted in preparation for the main study.
The Framework
The theoretical work guiding the development ofthe context questionnaires focused on policy-related issues.
The grid shown in Figure 3 is drawn from thework of Travers and Westbury (1989) [see alsoRobitaille and Garden (1989)] and was used toassist in planning the coverage of the PISA 2000 questionnaires. The factors thatinfluence student learning in any given countryare considered to be contextually shaped.Context is the set of circumstances in which astudent learns and they have antecedents thatdefine them in fundamental ways. Whileantecedents emerge from historical processes anddevelopments, PISA views them primarily interms of existing institutional and social factors.
A context finds expression in various ways—ithas a content. Content was seen in terms ofcurriculum for the Third InternationalMathematics and Science Study (TIMSS).
Stu
den
t an
d s
choo
l q
ues
tion
aire
dev
elop
men
t
35
However, PISA is based on a dynamic model oflifelong learning in which new knowledge andskills necessary for successful adaptation to achanging world are continuously acquiredthroughout life. It aims at measuring how wellstudents are likely to perform beyond the schoolcurriculum and therefore conceives contentmuch more broadly than other internationalstudies.
The antecedents, context and content repre-sent one dimension of the framework depicted inFigure 3. Each of these finds expression at adifferent level—in the educational system, theschool, the classroom, and the student.
These four levels constitute the second dimensionof the grid. The bold text in the matrix representthe core elements that PISA could cover given itsdesign. The design permits data to be collectedfor seven of the 12 cells depicted in Figure 3 andmarked in bold. The PISA design best coversschool and student levels but does not permitcollecting much information about class level.PISA does not collect information on system leveldirectly either, but this can be derived from othersources or by aggregating some PISA variables atthe school or country level.
Questionnaires were developed through aprocess of review, evaluation, and consultationand further refined by examining field trial data,input from interviews with students, anddeveloping the questionnaire framework.
Antecedents Context Content
System 1. Country features 2. Institutional settings 3. Intended Schoolingand policies Outcomes
School 4. Community and school 5. School conditions and 6. Implemented curriculumcharacteristics processes
Class 7. Teacher background 8. Class conditions and 9. Implemented curriculumcharacteristics processes
Student 10. Student background 11. Student classroom 12. Attained Schoolingcharacteristics behaviours Outcomes
Figure 3: Mapping the Coverage of the PISA 2000 Questionnaires
Coverage
The PISA context questionnaires covered all BPCpriorities with the exception of Opportunity toLearn, which the BPC subsequently removed asa priority.
Student Questionnaire
Basic demographics
Basic demographics information collected included:date of birth; grade at school; gender; familystructure; number of siblings; and birth order.
Family background and measures of socio-economic status
The family background variables included:parental engagement in the workforce; parentaloccupation (which was later transformed intothe PISA International Socio-Economic Index ofOccupational Status, ISEI); parental education;country of birth of students and parents;language spoken at home; social and culturalcommunication with parents; family educationalsupport; activities related to classical culture;family wealth; home educational resources; useof school resources; and home possessionsrelated to classical culture.
Student description of school/instructionalprocesses
For the main study, data were collected on classsize; disciplinary climate; teacher-studentrelations; achievement press; teacher support;frequency of homework; and time spent onhomework. This group of variables also coveredinstruction time; sense of belonging; teachers’use of homework; and whether a studentattended a remedial class, and provided data onstudents’ marks in mathematics, science and thestudents’ main language class3.
Student attitudes towards reading andreading habits
Attitudes to reading and reading habits includedfrequency of borrowing books; reading diversity;time spent in reading; engagement in reading;and number of books in the home.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
36
Student access to educational resourcesoutside school
Students were asked about different types ofeducational resources available in the homeincluding books, computers and a place to study.
Institutional patterns of participation andprogramme orientation
The programme in which the student wasenrolled was identified within the StudentQuestionnaire and coded into the InternationalStandard Classification of Education (ISCED)(OECD, 1999b), making the data comparableacross countries.
Student career and educational expectationsStudents were asked about their career andeducational plans in terms of job expectationsand highest education level expected.
School Questionnaire
Basic school charactecristics
Administrators were asked about schoollocation; school size; hours of schooling; schooltype (public or private control and funding);institutional structure; and type of school(including grade levels, admission and transferpolicies; length of the school year in weeks andclass periods; and available programmes).
School policies and practicesThis area included the centralisation ordecentralisation of school and teacher autonomyand decision-making; policy practices concerningassessment methods and their use; and theinvolvement of parents.
School climateIssues addressed under this priority area includedperception of teacher and student-related factorsaffecting school climate; and perception ofteachers’ morale and commitment.
School resourcesThis area included the quality of the schools’educational resources and physical infrastructureincluding student-teaching staff ratio; shortageof teachers; and availability of computers.
3 Because of difficulties associated with operation-alising the question on ‘marks’, this topic was aninternational option.
Cross-NationalAppropriateness of Items
The cross-national appropriateness of items wasan important design criterion for thequestionnaires. Data collected within any onecountry had to allow valid comparisons withothers. National committees through NPMsreviewed questions to minimisemisunderstandings or likely misinterpretationsbased on different educational systems, culturesor other social factors.
To further facilitate cross-nationalappropriateness, PISA used typologies designedfor use in collecting international data; namely,the International Standard Classification ofEducation (ISCED), and the InternationalStandard Classification of Occupations (ISCO).Once occupations had been coded into ISCO, thecodes were re-coded into the International Socio-Economic Index of Occupational Status (ISEI),which provides a measure of the socio-economicstatus of occupations in all countries participatingin PISA (see Chapter 17).
Cross-curricularCompetencies andInformation Technology
In addition to these two context questionnairesdeveloped by the Consortium, there were twoother instruments, available as internationaloptions,4 that were developed by OECD expertsor consultants.
The Cross-curricular CompetenciesInstrument
Early in the development of PISA, theimportance of collecting information on avariety of cross-curricular skills, knowledge andattitudes considered to be central goals in mostschool systems became apparent. While theseskills, knowledge, and attitudes may not be partof the formal curriculum, they can be acquiredthrough multiple, indirect learning experiencesinside and outside of school. From 1993 to 1996,
Stu
den
t an
d s
choo
l q
ues
tion
aire
dev
elop
men
t
37
the Member countries conducted a feasibil-ity studyon four cross-curriculum domains: Knowledge ofPolitics, Economics and Society (Civics);Problem-Solving; Self-Perception/ Self-Concept;and Communication Skills.
Nine countries participated in the pilot: Austria,Belgium (Flemish and French Communities),Hungary, Italy, the Netherlands, Norway, Switz-erland and the United States. The study was co-ordinated by the Department of Sociology at theUniversity of Groningen, on behalf of the OECD.The results (published in the OECD brochure,Prepared for Life (Peschar, 1997)) indicated thatthe Problem-Solving and the CommunicationSkills instruments required considerable invest-ment in developmental work before they couldbe used in an international assessment. However,both the Civics and the Self-Perception instrumentsshowed promising psychometric statistics andencouraging stability across countries.
The INES General Assembly, the EducationCommittee and the Centre for EducationalResearch and Innovation (CERI) GoverningBoard members were interested by the feasibilitystudy and encouraged the Member countries toinclude a Cross-Curriculum Competency (CCC)component in the Strategic Plan that they weredeveloping. The component retained for the firstcycle of PISA was self-concept, since a study oncivic education was being conducted at the sametime by the IEA (International Association forthe Evaluation of Educational Achievement) andit was decided to avoid possible duplications inthis area. The CCC component was not includedin the PISA General Terms of Reference, and theOECD subcontracted a separate agency toconduct this part of the study.
From various aspects of self-concept that wereincluded in the pilot study, the first cycle of PISAinvolved the use of an instrument that provided aself-report on self-regulated learning, based on thefollowing sets of constructs:• strategies of self-regulated learning, which
regulate how deeply and systematicallyinformation will be processed;
• motivational preferences and goalorientations, which regulate the investment oftime and mental energy for learning purposesand influence the choice of learning strategies;
• self-regulated cognition mechanisms, whichregulate the standards, aims, and processes ofaction in relation to learning;
4 That is, countries could use these instruments, theConsortium would support their use, and the datacollected would be included in the international dataset.
• action control strategies, particularly effortand persistence, which prevent the learnerfrom being distracted by competing intentionsand help to overcome learning difficulties; and
• preferences for different types of learningsituations, learning styles and social skills thatare required for co-operative learning.A total of 52 items was used in PISA 2000 to
tap these five main dimensions.A rationale for the self-regulated learning
instrument and the items used is presented inDEELSA/PISA/BPC (98)25. This documentfocuses on the dynamic model of continuouslifelong learning that underlies PISA, which itdescribes as:
‘new knowledge and skills necessary forsuccessful adaptation to changingcircumstances are continuously acquiredover the life span. While students cannotlearn everything they will need to knowin adulthood they need to acquire theprerequisites for successful learning infuture life. These prerequisites are bothof cognitive and motivational nature.This depends on individuals having theability to organise and regulate their ownlearning, including being prepared tolearn independently and in groups, andthe ability to overcome difficulties in thelearning process. Moreover, furtherlearning and the acquisition of additionalknowledge will increasingly occur insituations in which people work togetherand are dependent on one another.Therefore, socio-cognitive and socialcompetencies will have a directlysupportive function.’
The Information Technology orComputer Familiarity Instrument
The PISA 2000 Information Technology (IT) orComputer Familiarity instrument consisted of 10 questions, some of which had several parts.The first six were taken from an instrumentdeveloped by the Educational Testing Service(ETS) in the United States. ETS developed a 23-item questionnaire in response to concerns thatthe Test of English as a Foreign Language(TOEFL), taken mostly by students from othercountries as part of applying to study in the
United States, may have been producing biasedresults when administered on a computer becauseof different levels of computer familiarity (seeEignor et al., 1998). Since the instrument provedto be short and efficient, it was suggested that itbe added to PISA. Network A and the BPCapproved the idea.
Of the 23 items in the ETS instrument, 14were used in the PISA 2000 questionnaire, someof which were grouped or slightly reworked. Forexample, the ETS instrument asks How wouldyou rate your ability to use a computer? and thePISA questionnaire asks, If you compare yourselfwith other 15-year-olds, how would you rateyour ability to use a computer?
The last four questions in the PISAquestionnaire (ITQ7 to ITQ10) were added toprovide a measure of students’ interest in andenjoyment of computers, which are not coveredin the ETS questionnaire.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
38
Sample Design
Sheila Krawchuk and Keith Rust
Target Population andOverview of the SamplingDesign
The desired base PISA target population in eachcountry consisted of 15-year-old studentsattending educational institutions located withinthe country. This meant that countries were toinclude (i) 15-year-olds enrolled full-time ineducational institutions, (ii) 15-year-olds enrolledin educational institutions who attended on onlya part-time basis, (iii) students in vocationaltraining types of programmes, or any otherrelated type of educational programmes, and (iv)students attending foreign schools within thecountry (as well as students from other countriesattending any of the programmes in the firstthree categories). It was recognised that notesting of persons schooled in the home,workplace or out of the country would occurand therefore these students were not included inthe International Target Population.
The operational definition of an age populationdirectly depends on the testing dates. Theinternational requirement was that the assessmenthad to be conducted during a 42-day period1
between 1 March 2000 and 31 October 2000.Further, testing was not permitted during the firstthree months of the school year because of aconcern that student performance levels may belower at the beginning of the academic year thanat the end of the previous academic year.
The 15-year-old International TargetPopulation was slightly adapted to better fit the
age structure of most of the NorthernHemisphere countries. As the majority of thetesting was planned to occur in April, theinternational target population was consequentlydefined as all students aged from 15 years and3 (completed) months to 16 years and2 (completed) months at the beginning of theassessment period. This meant that in allcountries testing in April 2000, the nationaltarget population could have been defined as allstudents born in 1984 who were attending aschool or other educational institution.
Further, a variation of up to one month in thisage definition was permitted. For instance, acountry testing in March or in May was stillallowed to define the national target populationas all students born in 1984. If the testing was totake place at another time, the birth datedefinition had to be adjusted and approved bythe Consortium.
The sampling design used for the PISAassessment was a two-stage stratified sample inmost countries. The first-stage sampling unitsconsisted of individual schools having 15-year-old students. In all but a few countries, schoolswere sampled systematically from a comprehensivenational list of all eligible schools withprobabilities proportional to a measure of size.2
The measure of size was a function of theestimated number of eligible 15-year-oldstudents enrolled. Prior to sampling, schools inthe sampling frame were assigned to strataformed either explicitly or implicitly.
1 Referred to as the testing window.2 Referred to as Probability Proportional to Size (orPPS) sampling.
section two: operations
Chapter 4
The second-stage sampling units in countriesusing the two-stage design were students withinsampled schools. Once schools were selected tobe in the sample, a list of each sampled school’s15-year-old students was prepared. From eachlist that contained more than 35 students,35 students were selected with equal probabilityand for lists of fewer than 35, all students on thelist were selected.
In three countries, a three-stage design wasused. In such cases, geographical areas weresampled first (called first-stage units) usingprobability proportional to size sampling, andthen schools (called second-stage units) wereselected within sampled areas. Students were thethird-stage sampling units in three-stage designs.
Population Coverage, andSchool and StudentParticipation RateStandards
To provide valid estimates of studentachievement, the sample of students had to beselected using established and professionallyrecognised principles of scientific sampling, in away that ensured representation of the full targetpopulation of 15-year-old students.
Furthermore, quality standards had to bemaintained with respect to (i) the coverage of theinternational target population, (ii) accuracy andprecision, and (iii) the school and studentresponse rates.
Coverage of the International PISATarget Population
In an international survey in education, the typesof exclusion must be defined internationally andthe exclusion rates have to be limited. Indeed, ifa significant proportion of students wereexcluded, this would mean that survey resultswould not be deemed representative of the entirenational school system. Thus, efforts were madeto ensure that exclusions, if they were necessary,were minimised.
Exclusion can take place at the school level(the whole school is excluded) or at the within-school level. In PISA, there are several reasonswhy a school or a student can be excluded.
Exclusions at school level might result from
removing a small, remote geographical regiondue to inaccessibility or size, or from removing alanguage group, possibly due to political,organisational or operational reasons. Areasdeemed by the Board of Participating Countriesto be part of a country (for the purpose ofPISA), but which were not included for samplingwere designated as non-covered areas, anddocumented as such—although, this occurredinfrequently. Care was taken in this regardbecause, when such situations did occur, thenational desired target population differed fromthe international desired target population.
International within-school exclusion rules forstudents were specified as follows:• Educable mentally retarded students are
students who are considered in theprofessional opinion of the school principal,or by other qualified staff members, to beeducable mentally retarded or who have beentested psychologically as such. This categoryincludes students who are emotionally ormentally unable to follow even the generalinstructions of the test. Students were not tobe excluded solely because of poor academicperformance or normal discipline problems.
• Functionally disabled students are studentswho are permanently physically disabled insuch a way that they cannot perform in thePISA testing situation. Functionally disabledstudents who could respond were to beincluded in the testing.
• Non-native language speakers only studentswho had received less than one year ofinstruction in the language(s) of the test wereto be excluded.A school attended only by students who would
be excluded for mental, functional or linguisticreasons was considered as a school exclusion.
It was required that the overall exclusion ratewithin a country be kept below 5 per cent.Restrictions on the level of exclusions of varioustypes were as follows:• School-level exclusions for inaccessibility, size,
feasibility or other reasons were required tocover fewer than 0.5 per cent of the totalnumber of students in the International PISATarget Population.
• School-level exclusions for educable mentallyretarded students, functionally retarded studentsor non-native language speakers were requiredto cover fewer than 2 per cent of students.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
40
• Within-school exclusions for educablementally retarded students, functionallyretarded students or non-native languagespeakers were required to cover fewer than2.5 per cent of students.
Accuracy and Precision
A minimum of 150 schools (or all schools ifthere were fewer than 150 schools in aparticipating jurisdiction) had to be selected ineach country. Within each participating school,35 students were randomly selected with equalprobability, or in schools with less than35 eligible students, all students were selected. Intotal, a minimum sample size of 4 500 assessedstudents was to be achieved. It was possible tovary the number of students selected per school,but if fewer than 35 students per school were tobe selected, then: (i) the sample size of schoolswas increased beyond 150, so as to ensure thatat least 4 500 students were sampled, and (ii) thenumber of students selected per school had to beat least 20, so as to ensure adequate accuracy inestimating variance components within andbetween schools—an analytical objective of PISA.
National Project Managers were stronglyencouraged to identify stratification variables toreduce the sampling variance.
School Response Rates
A response rate of 85 per cent was required forinitially selected schools. If the initial schoolresponse rate fell between 65 and 85 per cent, anacceptable school response rate could still beachieved through the use of replacement schools.To compensate for a sampled school that did notparticipate, where possible two replacementschools were identified for each sampled school.Furthermore, schools with a student participationrate between 25 and 50 per cent were notconsidered as a participating school for thepurposes of calculating and documenting responserates. However, data from such schools wereincluded in the database and contributed to theestimates included in the initial PISA internationalreport. Data from schools with a studentparticipation rate of less than 25 per cent werenot included in the database.
The rationale for this approach was asfollows. There was concern that, in an effort to
meet the requirements for school response rates,a national centre might accept participation fromschools that would not make a concerted effortto have students attend the assessment sessions.To avoid this, a standard for studentparticipation was required for each individualschool in order that the school be regarded as aparticipant. This standard was set at 50 per cent.However, there were a few schools in manycountries that conducted the assessment withoutmeeting that standard. Thus a post-hocjudgement was needed to decide if the data fromstudents in such schools should be used in theanalyses, given that the students had alreadybeen assessed. If the students from such schoolswere retained, non-response bias would beintroduced to the extent that the students whowere absent were different in achievement fromthose who attended the testing session, and sucha bias is magnified by the relative sizes of thesetwo groups. If one chose to delete all assessmentdata from such schools, then non-response biaswould be introduced to the extent that theschool was different from others in the sample,and sampling variance is increased because ofsample size attrition.
The judgement was made that, for a schoolwith between 25 and 50 per cent studentresponse, the latter source of bias and variancewas likely to introduce more error into the studyestimates than the former, but with the conversejudgement for those schools with a studentresponse rate below 25 per cent. Clearly the cut-off of 25 per cent is an arbitrary one, as onewould need extensive studies to try to establishthis cut-off empirically. However, it is clear that,as the student response rate decreases within aschool, the bias from using the assessed studentsin that school will increase, while the loss insample size from dropping all of the students inthe school will rapidly decrease.
Figure 4 provides a summary of theinternational requirements for school responserates.
These PISA standards applied to weightedschool response rates. The procedures forcalculating weighted response rates are presentedin Chapter 8. Weighted response rates weighteach school by the number of students in thepopulation that are represented by the studentssampled from within that school. The weightconsists primarily of the enrolment size of
Op
erat
ion
s
41
participating: the student response rate wascomputed using only students from schools withat least a 50 per cent response rate. Again,weighted student response rates were used forassessing this standard. Each student wasweighted by the reciprocal of his/her sampleselection probability. Weighted and unweightedrates differed little within any country.3
Main Study School Sample
Definition of the National TargetPopulation
National Project Managers (NPMs) were firstrequired to confirm their dates of testing and agedefinition with the PISA Consortium. Once thesewere approved, NPMs were alerted to avoidhaving the possible drift in the assessment periodlead to an unapproved definition of the nationaltarget population
Every NPM was required to define anddescribe his/her country’s national desired target
15-year-old students in the school, divided bythe selection probability of the school. Becausethe school samples were in general selected withprobability proportional to size, in mostcountries most schools contributed equalweights, so that weighted and unweighted schoolresponse rates were very similar. Exceptionscould occur in countries that had explicit stratathat were sampled at very different rates.However, in no case did a country differsubstantially with regard to the response ratestandard when unweighted rates were used,compared to weighted rates.
Details as to how the PISA participantsperformed relative to these school response ratestandards are included in the quality indicatorssection of Chapter 15.
Student Response Rates
A response rate of 80 per cent of selectedstudents in participating schools was required. Astudent who had participated in the first part ofthe testing session was considered to be a partic-ipant. A student response rate of 50 per cent wasrequired for a school to be regarded as
PIS
A 2
00
0 t
ech
nic
al r
epor
t
42
100%
95%
90%
85%
80%
75%
70%
65%
60%
60%0 65% 70% 75% 80% 85% 90% 95% 100%
OECD PISASchool Response Rates
Not Acceptable
Response Rate Before Replacements
Res
pons
e R
ate
Aft
er R
epla
cem
ents
Replacementsuccessful
Tried replacementbut without success
Figure 4: School ResponseRate Standards - Criteriafor Acceptability
3 After students from schools with below 50 per centstudent response were removed from the calculations.
population and explain how and why it mightdeviate from the international target population.Any hardships in accomplishing completecoverage were specified, discussed and approvedor not, in advance. Where the national desiredtarget population deviated from full nationalcoverage of all eligible students, the deviationswere described and enrolment data provided tomeasure how much coverage was reduced.
School-level and within-school exclusionsfrom the national desired target populationresulted in a national defined target populationcorresponding to the population of studentsrecorded on each country’s school samplingframe. Schools were usually excluded forpractical reasons such as increased survey costsor complexity in the sample design and/ordifficult test conditions. They could be excluded,depending on the percentage of 15-year-oldstudents involved, if they were geographicallyinaccessible (but not part of a region omittedfrom the national desired target population), orextremely small, or if it was not feasible toadminister the PISA assessment. Thesedifficulties were mainly addressed by modifyingthe sample design to reduce the number of suchschools selected rather than to exclude them,and exclusions from the national desired targetpopulation were held to a minimum and werealmost always below 0.5 per cent. Otherwise,countries were instructed to include the schoolsbut to administer the PISA SE booklet4,consisting of a subset of the easier PISAassessment items.
Within-school, or student-level, exclusionswere generally expected to be less than 2.5 percent in each country, allowing an overall level ofexclusions within a country to be no more than5 per cent. Because definitions of within-schoolexclusions could vary from country to country,however, NPMs were asked to adapt thefollowing rules to make them workable in theircountry but still to code them according to thePISA international coding scheme.
Within participating schools, all eligiblestudents (i.e., born within the defined timeperiod, regardless of grade) were to be listed.From this, either a sample of 35 students wasrandomly selected or all students were selected if
Op
erat
ion
s
43
there were fewer than 35 15-year-olds. The listshad to include sampled students deemed to meetone of the categories for exclusion, and avariable maintained to briefly describe thereason for exclusion. This made it possible toestimate the size of the within-school exclusionsfrom the sample data.
It was understood that the exact extent ofwithin-school exclusions would not be knownuntil the within-school sampling data werereturned from participating schools, andsampling weights computed. Country participantprojections for within-school exclusionsprovided before school sampling were known tobe estimates.
NPMs were made aware of the distinctionbetween within-school exclusions and non-response. Students who could not take theachievement tests because of a permanentcondition were to be excluded and those with atemporary impairment at the time of testing,such as a broken arm, were treated as non-respondents along with other absent sampledstudents.
Exclusions by country participant aredocumented in Chapter 12 in Section 4.
The Sampling Frame
All NPMs were required to construct a schoolsampling frame to correspond to their nationaldefined target population. This was defined bythe manual as a frame that would providecomplete coverage of the national defined targetpopulation without being contaminated byincorrect or duplicate entries or entries referringto elements that were not part of the definedtarget population. Initially, this list was toinclude any school with 15-year-old students,even those who might later be excluded. Thequality of the sampling frame directly effects thesurvey results through the schools’ probabilitiesof selection and therefore their weights and thefinal survey estimates. NPMs were thereforeadvised to be very careful in constructing theirframes, while realising that the frame dependslargely on the availability of appropriateinformation about schools and students.
All but three countries used school-levelsampling frames as their first stage of sampleselection. The Sampling Manual indicated thatthe quality of sampling frames for both two and
4 The SE booklet, also referred to as Booklet 0, wasdescribed in the section on test design in Chapter 2.
three-stage designs would largely depend on theaccuracy of the approximate enrolment of 15-year-olds available (ENR) for each first-stagesampling unit. A suitable ENR value was acritical component of the sampling frames sinceselection probabilities were based on it for bothtwo and three-stage designs. The best ENR forPISA would have been the number of currentlyenrolled 15-year-old students. Current enrolmentdata, however, were rarely available at the timeof sampling, which meant using alternatives.Most countries used the first option from thebest alternatives given as follows:• student enrolment in the target age category
(15-year-olds) from the most recent year ofdata available;
• if 15-year-olds tend to be enrolled in two ormore grades, and the proportions of studentswho are 15 in each grade are approximatelyknown, the 15-year-old enrolment can beestimated by applying these proportions to thecorresponding grade-level enrolments;
• the grade enrolment of the modal grade for15-year-olds; or
• total student enrolment, divided by thenumber of grades in the school.The Sampling Manual noted that if reasonable
estimates of ENR did not exist or if the availableenrolment data were too out of date, schoolsmight have to be selected with equalprobabilities. This situation did not occur forany country.
Besides ENR values, NPMs were instructedthat each school entry on the frame shouldinclude at minimum:• school identification information, such as a
unique numerical national identification, andcontact information such as name, addressand phone number; and
• coded information about the school, such asregion of country, school type and extent ofurbanisation, which could be used asstratification variables.5
As noted, three-stage designs and area-levelsampling frames were used by three countrieswhere a comprehensive national list of schoolswas not available and could not be constructedwithout undue burden, or where the procedures
PIS
A 2
00
0 t
ech
nic
al r
epor
t
44
for administering the test required that the schoolsbe selected in geographic clusters. As aconsequence, area-level sampling framesintroduced an additional stage of frame creationand sampling (called the first stage of sampling)before actually sampling schools (the second stageof sampling). Although generalities about three-stage sampling and using an area-level samplingframe were outlined in the Sampling Manual (forexample that there should be at least 80 first-stageunits and about half of them needed to besampled), NPMs were also instructed in theSampling Manual that the more detailedprocedures outlined there for the general two-stage design could easily be adapted to the three-stage design. NPMs using a three-stage designwere also asked to notify the Consortium andreceived additional support in using an area-levelsampling frame. The countries that used a three-stage design were Poland, the Russian Federationand the United States. Germany also used a three-stage design but only in two of its explicit data.
Stratification
Prior to sampling, schools were to be ordered, orstratified, in the sampling frame. Stratificationconsists of classifying schools into like groupsaccording to some variables—referred to asstratification variables. Stratification in PISA wasused for the following reasons:• to improve the efficiency of the sample design,
thereby making the survey estimates morereliable;
• to apply different sample designs, such asdisproportionate sample allocations, tospecific groups of schools, such as those instates, provinces, or other regions;
• to ensure that all parts of a population wereincluded in the sample; and
• to ensure adequate representation of specificgroups of the target population in the sample.There were two types of stratification possible:
explicit and implicit. Explicit stratification con-sists of building separate school lists, or samplingframes, according to the set of explicit stratificationvariables under consideration. Implicit stratificationconsists essentially of sorting the schools within eachexplicit stratum by a set of implicit stratificationvariables. This type of stratification is a very simpleway of ensuring a strictly proportional sampleallocation of schools across all implicit strata. It
5 Variables used for dividing the population intomutually exclusive groups so as to improve theprecision of sample-based estimates.
can also lead to improved reliability of surveyestimates, provided that the implicit stratificationvariables being considered are correlated withPISA achievement (at the school level). Guidelineswere provided on how to go about choosingstratification variables.
Table 15 provides the explicit stratificationvariables used by each country, as well as thenumber of explicit strata, and the variables and
their number of levels used for implicitstratification.6
6 As countries were requested to sort the samplingframe by school size, school size was also an implicitstratification variable, though it is not listed in Table 15. A variable used for stratification purposes is not necessarily included in the PISA data files.
Op
erat
ion
s
45
Table 15: Stratification Variables
Country Explicit Stratification Number of Implicit Stratification Variables Explicit Strata Variables
Australia State/Territory (8); Sector (3) 24 Metropolitan/Country (2)
Austria School Type (20) 19 School Identification Number(State, district, school)
Belgium (Fl.) School Type (5) 5 Within strata 1-3: Combinations ofSchool Track (8); Organised by: Private/Province/State/Local Authority/Other
Belgium (Fr.) School Size (3); 4 Proportion of Over-Age StudentsSpecial Education
Brazil School Grade (grades 5-10, 3 Public/Private (2); Region (5); 5-6 only, 7-10) and Urbanisation Score Range (5)
Canada Province (10); Language(3); 49 Public/Private (2); Urban/Rural (2)School Size (5)
Czech Republic School Type (6); School Size (4) 24 None
Denmark School Size (3) 3 School Type (4); County (15)
England School Size (2) 2 Very small schools: School type (2);Region (4) Other schools: SchoolType (3); Exam results: LEA schools(6), Grant Maintained schools (3)or Coeducational status inIndependent schools (3); Region (4)
Finland Region (6); Urban/Rural (2) 11 None
France Type of school for large 6 Noneschools (4); Very small schools; Moderately small schools
Germany Federal State (16); School Type (7) 73 For Special Education andVocational Schools: Federal State (16)
Greece Region (10); Public/Private (2) 20 School Type (3)
Hungary School Type (5) 2 Region (20)
Iceland Urban/Rural (2); School Size (4) 8 None
PIS
A 2
00
0 t
ech
nic
al r
epor
t
46
Table 15 (cont.)
Country Explicit Stratification Number of Implicit StratificationVariables Explicit Strata Variables
Ireland School Size (3) 3 School Type (3); School GenderComposition (3)
Italy School Type (2); School Size (3); 21 Area (5)School Programme (4)
Latvia School Size (3) 3 Urbanisation (3); School Type (5)
Korea School Type and Urbanisation 10 Administrative Units (16)
Japan Public/Private (2); 4 NoneSchool Programme (2)
Liechtenstein Public/Private (2) 2 None
Luxembourg None 1 None
Mexico School Size (3) 3 School Type for Lower Secondary(4); School Type for UpperSecondary (3)
Netherlands None 1 School Track (5)
New Zealand School Size (3) 3 Public/Private (2); School GenderComposition (3); Socio-EconomicStatus (3); Urban/Rural (2)
Northern Ireland None 1 School Type (3); Exam Results:Secondary (4), Grammar (2);Region (5)
Norway School Size/School Type (5) 5 None
Poland Type of School (4) 4 Type of Gmina (Region) (4)
Portugal School Size (2) 2 Public/Private (2); Region (7);Index of Social Development (4)
Russian Federation Region (45); Small Schools (2) 47 School Programme (2); Urbanisation(6)
Scotland Public/Private (2) 2 For Public Schools: Average SchoolAchievement (5)
Spain Community (17); Public/Private (2) 34 Geographic unit in somecommunities; Population size inothers
Sweden Public (1); Private (3); Community 10 Private schools: Community Group for Public Schools (5); Group (9); Public schools: Income Secondary Schools (1) Quartile (4); Secondary Schools:
Geographic Area (22)
Switzerland Area and Language (11); Public/ 27 National Track (9); Canton (26); Private (2); School Programme (4); Cantonal TrackHas grade 9 or not (2); School Size (3)
United States None 1 Public/Private (2); High/LowMinority (2); Private School Type(3); Primary Sampling Unit (52)
Note: The absence of brackets indicates that numerous levels existed for that stratification variable.
Treatment of small schools in stratificationIn PISA, small, moderately small and very smallschools were identified, and all others wereconsidered large. A small school had anapproximate enrolment of 15-year-olds (ENR)below the target cluster size (TCS = 35 in mostcountries) of numbers of students to be sampledfrom schools with large enrolments. A very smallschool had an ENR less than one-half the TCS—17 or less in most countries. A moderatelysmall school had an ENR in the range of to TCS. Unless they received special treatment,small schools in the sample could reduce thesample size of students for the national sampleto below the desired target because the in-schoolsample size would fall short of expectations. Asample with many small schools could also be anadministrative burden. To minimise theseproblems, procedures for stratifying andallocating school samples were devised for smallschools in the sampling frame.
To determine what was needed—a singlestratum of small schools (very small andmoderately small combined), a stratum of verysmall schools only, or two strata, one of verysmall schools and one of moderately smallschools—the Sampling Manual stipulated that ifthe percentage of students in small schools(those schools for which ) was:• less than 5 per cent and the percentage in very
small schools (those for which ( )was less than 1 per cent, no small schoolstrata were needed.
• less than 5 per cent but the percentage in verysmall schools was 1 per cent or more, astratum for very small schools was needed.
• 5 per cent or more, but the percentage in verysmall schools was less than 1 per cent, astratum for small schools was needed, but nospecial stratum for very small schools wasrequired.
• 5 per cent or more, and the percentage in verysmall schools was 1 per cent or more, astratum for very small schools was needed,and a stratum for moderately small schoolswas also needed.The small school strata could be further
divided into additional explicit strata on thebasis of other characteristics (e.g., region).Implicit stratification could also be used.
When small schools were explicitly stratified,it was important to ensure that an adequate
sample was selected without selecting too manysmall schools as this would lead to too fewstudents in the assessment. In this case, the entireschool sample would have to be increased tomeet the target student sample size.
The sample had to be proportional to thenumber of students and not to the number ofschools. Suppose that 10 per cent of studentsattend moderately small schools, 10 per centvery small schools and the remaining 80 per centattend large schools. In the sample of 5 250, 4 200 students would be expected to come fromlarge schools (i.e., 120 schools by 35 students),525 students from moderately small schools and525 students from very small schools. Ifmoderately small schools had an average of25 students, then it would be necessary to include21 moderately small schools in the sample. If theaverage size of very small schools was 10 students,then 52 very small schools would be needed inthe sample and the school sample size would beequal to 193 schools rather than 150.
To balance the two objectives of selecting anadequate sample of explicitly stratified smallschools, a procedure was recommended thatassumes identifying strata of both very small andmoderately small schools. The underlying idea isto under-sample by a factor of two the verysmall school stratum and to increaseproportionally the size of the large school strata.When there was just a single small schoolstratum, the procedure was modified by ignoringthe parts concerning very small schools. Theformulae below also assume a school sample sizeof 150 and a student sample size of 5 250.
- Step 1: From the complete sampling frame,find the proportions of total ENR that comefrom very small schools (P), moderately smallschools (Q), and larger schools (those withENR of at least TCS) (R). Thus .
- Step 2: Calculate the figure L, where. Thus L is a positive number
slightly less than 1.0.
- Step 3: The minimum sample size for largerschools is equal to , rounded tothe nearest integer. It may need to be enlargedbecause of national considerations, such asthe need to achieve minimum sample sizes forgeographic regions or certain school types.
150 × ( )R L
L P= −( )1 2
P Q R+ + =1
ENR TCS< 2
ENR TCS<
TCS 2
Op
erat
ion
s
47
- Step 4: Calculate the mean value of ENR formoderately small schools (MENR), and forvery small schools (VENR). MENR is anumber in the range of TCS/2 to TCS, andVENR is a number no greater than TCS/2.
- Step 5: The number of schools that must besampled from the stratum of moderately smallschools is given by: (5 250 x Q)/(L x MENR)
- Step 6: The number of schools that must besampled from the stratum of very smallschools is given by: (2 625 x P)/(L x VENR).
To illustrate the steps, suppose that inparticipant country X, the TCS is equal to 35,with 0.1 of the total enrolment of 15-year-oldsin moderately small schools and in very smallschools. Suppose that the average enrolment inmoderately small schools is 25 students, and invery small schools it is 10 students. Thus P = 0.1,Q = 0.1, R = 0.8, MENR = 25 and VENR = 10.
From Step 2, L = 0.95, then (Step 3) thesample size of larger schools must be at least150 x (0.80/0.95) = 126.3. That is, at least 126 ofthe larger schools must be sampled. From Step 5,the number of moderately small schools requiredis (5 250 x 0.1)/(0.95 x 25) = 22.1 —i.e., 22 schools.From Step 6, the number of very small schoolsrequired is (2 625 x 0.1)/(0.95 x 10) = 27.6—i.e., 28 schools.
This gives a total sample size of126 + 22 2 28 = 176 schools, rather than just 150,or 193 as calculated above. Before consideringschool and student non-response, the largerschools will yield a sample of 126 x 35 = 4 410students. The moderately small schools will givean initial sample of approximately 22 x 25 = 550students, and very small schools will give aninitial sample size of approximately 28 x 10 = 280students. The total initial sample size of studentsis therefore 4 410 + 550 + 280 = 5 240.
Assigning a Measure of Size to EachSchool
For the probability proportional to size samplingmethod used for PISA, a measure of size (MOS)derived from ENR was established for eachschool on the sampling frame. Where no explicitstratification of very small schools was requiredor if small schools (including very small schools)
were separately stratified because school size wasan explicit stratification variable and they didnot account for 5 per cent or more of the targetpopulation, NPMs were asked to construct theMOS as: .
The measure of size was therefore equal to theenrolment estimate, unless it was less than theTCS, in which case it was set equal to the targetcluster size. In most countries, so thatthe MOS was equal to ENR or 35, whicheverwas larger.
When an explicit stratum of very smallschools (schools with ENR less than or equal toTCS/2) was required, the MOS for each verysmall school was set equal to TCS/2. As sampleschools were selected according to their size (PPS),setting the measure of size of small schools to 35is equivalent to drawing a simple, randomsample of small schools.
School Sample Selection
Sorting the sampling frame
The Sampling Manual indicated that, prior toselecting schools from the school samplingframe, schools in each explicit stratum were tobe sorted by variables chosen for implicitstratification and finally by the ENR valuewithin each implicit stratum. The schools werefirst to be sorted by the first implicitstratification variable, then by the secondimplicit stratification variable within the levels ofthe first sorting variable, and so on, until allimplicit stratification variables were exhausted.This gave a cross-classification structure of cells,where each cell represented one implicit stratumon the school sampling frame. NPMs were toalternate the sort order between implicit strata,from high to low and then low to high, etc.,through all implicit strata within an explicitstratum.
School sample allocation over explicit strataThe total number of schools to be sampled ineach country needed to be allocated among theexplicit strata so that the expected proportion ofstudents in the sample from each explicitstratum was approximately the same as thepopulation proportions of eligible students ineach corresponding explicit stratum. There weretwo exceptions. If an explicit stratum of verysmall schools was required, students in them had
TCS = 35
MOS ENR TCS= ( )max ,
PIS
A 2
00
0 t
ech
nic
al r
epor
t
48
smaller percentages in the sample than those inthe population. To compensate for the resultingloss of sample, the large school strata hadslightly higher percentages in the sample thanthe corresponding population percentages. Theother exception occurred if only one school wasallocated to any explicit stratum. In these cases,NPMs were requested to allocate two schoolsfor selection in the stratum.
Determining which schools to sampleThe Probability Proportional to Size (PPS)systematic sampling method used in PISA firstrequired the computation of a sampling intervalfor each explicit stratum. This calculationinvolved the following steps:• recording the total measure of size, S, for all
schools in the sampling frame for eachspecified explicit stratum;
• recording the number of schools, D, to besampled from the specified explicit stratum,which was the number allocated to theexplicit stratum;
• calculating the sampling interval, I, asfollows: I = S/D; and
• recording the sampling interval, I, to fourdecimal places.Next, a random number (drawn from a
uniform distribution) had to be selected for eachexplicit stratum. The generated random number(RN) was to be a number between 0 and 1 andwas to be recorded to four decimal places.
The next step in the PPS selection method ineach explicit stratum was to calculate selectionnumbers—one for each of the D schools to beselected in the explicit stratum. Selectionnumbers were obtained using the followingmethod:• obtaining the first selection number by
multiplying the sampling interval, I, by therandom number, RN. This first selectionnumber was used to identify the first sampledschool in the specified explicit stratum;
• obtaining the second selection number bysimply adding the sampling interval, I, to thefirst selection number. The second selectionnumber was used to identify the secondsampled school; and
• continuing to add the sampling interval, I, tothe previous selection number to obtain thenext selection number. This was done until allspecified line numbers (1 through D) had been
assigned a selection number.Thus, the first selection number in an explicit
stratum was RN x I, the second selectionnumber was (RN x I) + I, the third selectionnumber was (RN x I) + I + I, and so on.
Selection numbers were generated independentlyfor each explicit stratum, with a new randomnumber selected for each explicit stratum.
Identifying the sampled schoolsThe next task was to compile a cumulativemeasure of size in each explicit stratum of theschool sampling frame that determined whichschools were to be sampled. Sampled schoolswere identified as follows.
Let Z denote the first selection number for aparticular explicit stratum. It was necessary tofind the first school in the sampling frame wherethe cumulative MOS equalled or exceeded Z.This was the first sampled school. In otherwords, if Cs was the cumulative MOS of aparticular school S in the sampling frame andC(s-1) was the cumulative MOS of the schoolimmediately preceding it, then the school inquestion was selected if: Cs was greater than orequal to Z, and C(s-1) was strictly less than Z.Applying this rule to all selection numbers for agiven explicit stratum generated the originalsample of schools for that stratum.
Identifying replacement schoolsEach sampled school in the main survey wasassigned two replacement schools from thesampling frame, identified as follows. For eachsampled school, the schools immediatelypreceding and following it in the explicit stratumwere designated as its replacement schools. Theschool immediately following the sampled schoolwas designated as the first replacement andlabelled R1, while the school immediatelypreceding the sampled school was designated asthe second replacement and labelled R2. TheSampling Manual noted that in small countries,there could be problems when trying to identifytwo replacement schools for each sampledschool. In such cases, a replacement school wasallowed to be the potential replacement for twosampled schools (a first replacement for thepreceding school, and a second replacement forthe following school), but an actual replacementfor only one school. Additionally, it may havebeen difficult to assign replacement schools for
Op
erat
ion
s
49
some very large sampled schools because thesampled schools appeared very close to eachother in the sampling frame. There were timeswhen NPMs were only able to assign a singlereplacement school, or even none, when twoconsecutive schools in the sampling frame weresampled.
Exceptions were allowed if a sampled schoolhappened to be the last school listed in anexplicit stratum. In this case the two schoolsimmediately preceding it were designated asreplacement schools, or the first school listed inan explicit stratum, in which case the twoschools immediately following it were designatedreplacement schools.
Assigning school identifiersTo keep track of sampled and replacementschools in the PISA database, each was assigneda unique, three-digit school code and two-digitstratum code (corresponding to the explicitstrata) sequentially numbered starting with onewithin each explicit stratum. For example, if 150 schools are sampled from a single explicitstratum, they are assigned identifiers from 001 to 150. First replacement schools in themain survey are assigned the school identifier oftheir corresponding sampled schools, incrementedby 200. For example, the first replacementschool for sampled school 023 is assigned schoolidentifier 223. Second replacement schools in themain survey are assigned the school identifier oftheir corresponding sampled schools, butincremented by 400. For example, the secondreplacement school for sampled school 136 tookthe school identifier 536.
Tracking sampled schoolsNPMs were encouraged to make every effort toconfirm the participation of as many sampledschools as possible to minimise the potential fornon-response biases. They contacted replacementschools after all contacts with sampled schoolswere made. Each sampled school that did notparticipate was replaced if possible. If both anoriginal school and a replacement participated,only the data from the original school wereincluded in the weighted data.
Monitoring the School SampleAll countries had the option of having theirschool sample selected by the Consortium.Belgium (Fl.), Belgium (Fr.), Latvia, Norway,Portugal and Sweden took this option. Theywere required to submit sampling forms 1 (timeof testing and age definition), 2 (national desiredtarget population), 3 (national defined targetpopulation), 4 (sampling frame description), 5(excluded schools), 7 (stratification), 11 (schoolsampling frame) and 12 (school tracking form).The Consortium completed and returned theothers (forms 6, 8, 9 and 10).
For each country drawing its own sample, thetarget population definition, the levels ofexclusions, the school sampling procedures, andthe school participation rates were monitored by12 sampling forms submitted to the Consortiumfor review and approval. Table 16 provides asummary of the information required on eachform and the timetables (which depended onnational assessment periods). (See Appendix 9for copies of the sampling forms.)
Once received from each country, each formwas reviewed and feedback was provided to thecountry. Forms were only approved after allcriteria were met. Approval of deviations wasonly given after discussion and agreement by theConsortium. In cases where approval could notbe granted, countries were asked to makerevisions to their sample design and samplingforms.
Checks that were performed in the monitoringof each form follow. All entries were observed intheir own right but those below are additionalmatters explicitly examined.
Sampling Form 1: Time of Testing and AgeDefinition• Assessment dates had to be appropriate for
the selected target population dates.• Assessment dates could not cover more than a
42-day period.
Sampling Form 2: National Desired TargetPopulation
• Large deviations between the total nationalnumber of 15-year-olds and the enrollednumber of 15-year-olds were questioned.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
50
Op
erat
ion
s
51
Table 16: Schedule of School Sampling Activities
Activity Sampling Form Submitted Due Dateto Consortium
Specify time of testing 1 - Time of Testing and Age At least six months before the beginning and age definition of Definition of testing and at least two months population to be tested before planned school sample selection.
Define National Desired 2 - National Desired Target Two months prior to the date of planned Target Population Population selection of the school sample.
Define National Defined 3 - National Defined Target Two months prior to the date of planned Target Population Population selection of the school sample.
Create and describe 4 - Sampling Frame Description One month before final approval of the sampling frame school sample is required, or two
months before data collection is tobegin, whichever is the earlier.
Decide on schools to 5 - Excluded Schools One month before final approval of the be excluded from school sample is required, or twosampling frame months before data collection is to
begin, whichever is the earlier.
Decide how to treat 6 - Treatment of Small Schools One month before final approval of thesmall schools school sample is required, or two
months before data collection is tobegin, whichever is the earlier.
Decide on explicit and 7 - Stratification One month before final approval of theimplicit stratification school sample is required, or two variables months before data collection is to
begin, whichever is the earlier.
Describe population 8 - Population Counts by Strata One month before final approval of thewithin strata school sample is required, or two
months before data collection is tobegin, whichever is the earlier.
Allocate sample over 9 - Sample Allocation by One month before final approval of theexplicit strata Explicit Strata school sample is required, or two
months before data collection is tobegin, whichever is the earlier.
Select the school sample 10 - School Sample Selection One month before final approval of theschool sample is required, or twomonths before data collection is tobegin, whichever is the earlier.
Identify sampled schools, 11 - School Sampling Frame One month before final approval of the replacement schools and school sample is required, or two assign PISA school months before data collection is toidentification numbers begin, whichever is the earlier.
Create a school 12 - School Tracking Form Within one month of the end of the data tracking form collection period.
• Any population to be omitted from theinternational desired population was notedand discussed, especially if the percentage of15-year-olds to be excluded was more than 2 per cent.
• Calculations were verified.
Sampling Form 3: National Defined TargetPopulation
• The population figure in the first questionneeded to correspond with the finalpopulation figure on Sampling Form 2.
• Reasons for excluding schools were checkedfor appropriateness.
• The number and percentage of students to beexcluded at the school level and whether thepercentage was less than the maximum percent-age allowed for such exclusions were checked.
• Calculations were verified and the overallcoverage figures were assessed.
Sampling Form 4: Sampling Frame Description
• Special attention was paid to countries whoreported on this form that a three-stagesampling design was to be implemented andadditional information was sought fromcountries in such cases to ensure that the first-stage sampling was done adequately.
• The type of school-level enrolment estimateand the year of data availability were assessedfor reasonableness.
Sampling Form 5: Excluded Schools
• The number of schools and the total enrolmentfigures, as well as the reasons for exclusion,were checked to ensure correspondence withfigures reported on Sampling Form 3 aboutschool-level exclusions.
Sampling Form 6: Treatment of Small Schools
• Calculations were verified, as was the decisionabout whether or not a moderately smallschools stratum and/or a very small schoolsstratum were needed.
Sampling Form 7: Stratification
• Since explicit strata are formed to group likeschools together to reduce sampling varianceand to ensure appropriate representativenessof students in various school types, usingvariables that might have an effect onoutcomes, each country’s choice of explicit
PIS
A 2
00
0 t
ech
nic
al r
epor
t
52
stratification variables was assessed. If acountry was known to have school tracking,and tracks or school programmes were notamong the explicit stratifiers, a suggestionwas made to include this type of variable.
• If no implicit stratification variables werenoted, suggestions were made about ones thatmight be used.
• The sampling frame was checked to ensurethat the stratification variables were availablefor all schools. Different explicit strata wereallowed to have different implicit stratifiers.
Sampling Form 8: Population Counts by Strata
• Counts on Sampling Form 8 were comparedto counts arising from the frame. Anydifferences were queried and almost alwayscorrected.
Sampling Form 9: Sample Allocation byExplicit Strata
• All explicit strata had to be accounted for onSampling Form 9.
• All explicit strata population entries werecompared to those determined from thesampling frame.
• The calculations for school allocation werechecked to ensure that schools were allocatedto explicit strata based on explicit stratumstudent percentages and not explicit stratumschool percentages.
• The percentage of students in the sample foreach explicit stratum had to be close to thepercentage in the population for each stratum(very small schools strata were an exceptionsince under-sampling was allowed).
• The overall number of schools to be sampledwas checked to ensure that at least 150schools would be sampled.
• The overall number of students to be sampledwas checked to ensure that at least 5 250students would be sampled.
Sampling Form 10: School Sample Selection
• All calculations were verified.• Particular attention was paid to the four
decimal places that were required for both thesampling interval and the random number.
Sampling Form 11: School Sampling Frame
• The frame was checked for proper sortingaccording to the implicit stratification scheme
and enrolment values, and the properassignment of the measure of size value,especially for moderately small and very smallschools. The accumulation of the measure ofsize values was also checked for each explicitstratum. This final cumulated measure of sizevalue for each stratum had to correspond tothe ‘Total Measure of Size’ value on SamplingForm 10 for each explicit stratum.Additionally, each line selection number waschecked against the frame cumulative measureof size figures to ensure that the correct schoolswere sampled. Finally, the assignment ofreplacement schools and PISA identificationnumbers were checked to ensure that all ruleslaid out in the Sampling Manual were adheredto. Any deviations were discussed with eachcountry and either corrected or the deviationsaccepted.
Sampling Form 12: School Tracking Form
• Sampling Form 12 was checked to see that thePISA identification numbers on this formmatched those on the sampling frame.
• Checks were made to ensure that all sampledand replacement schools were accounted for.
• Checks were also made to ensure that statusentries were in the requested format.
Student Samples
Student selection procedures in the main studywere the same as those used in the field trial.Student sampling was generally undertaken atthe national centres from lists of all eligiblestudents in each school that had agreed toparticipate. These lists could have been preparedat national, regional, or local levels as data files,computer-generated listings, or by hand,depending on who had the most accurateinformation. Since it was very important that thestudent sample be selected from accurate,complete lists, the lists needed to be preparednot too far in advance of the testing and had tolist all eligible students. It was suggested that thelists be received one to two months beforetesting so that the NPM would have the time toselect the student samples.
Some countries chose student samples that in-cluded students aged 15 and/or enrolled in a specificgrade (e.g., grade 10). Thus, a larger overall sample,including 15-year-old students and students in
the designated grade (who may or may not havebeen aged 15) were selected. The necessary stepsin selecting larger samples are highlighted whereappropriate in the following steps. Only Iceland,Japan, Hungary and Switzerland selected gradesamples and none used the standard methoddescribed here. For Iceland and Japan, the samplewas called a grade sample because over 99.5 percent of the PISA eligible 15-year-olds were in thegrade sampled. Switzerland supplemented thestandard method with additional schools whereonly a sample of grade-eligible students (15-year-olds plus others) was selected. Hungary selecteda grade 10 classroom from each PISA school,independent of the PISA student sample.
Preparing a list of age-eligible studentsAppendix 10 shows an example Student ListingForm as well as school instructions about howto prepare the lists. Each school drawing anadditional grade sample was to prepare a list ofage and grade-eligible students that included allstudents in the designated grade (e.g., grade 10);and all other 15-year-old students (using theappropriate 12-month age span agreed upon foreach country) currently enrolled in other grades.
NPMs were to use the Student Listing Formas shown in the Appendix 10 example but coulddevelop their own instructions. The followingwere considered important:• Age-eligible students were all students born in
1984 (or the appropriate 12-month age spanagreed upon for the country).
• The list was to include students who mightnot be tested due to a disability or limitedlanguage proficiency.
• Students who could not be tested were to beexcluded from the assessment after thestudent sample was selected.
• It was suggested that schools retain a copy ofthe list in case the NPM had to call the schoolwith questions.
• A computer list was to be up-to-date at thetime of sampling rather than prepared at thebeginning of the school year.Students were identified by their unique
student identification numbers.
Selecting the student sampleOnce NPMs received the list of eligible studentsfrom a school, the student sample was to beselected and the list of selected students (i.e., the
Op
erat
ion
s
53
PIS
A 2
00
0 t
ech
nic
al r
epor
t
54
Student Tracking Form) returned to the school.NPMs were encouraged to use KeyQuest®, thePISA sampling software, to select the studentsamples. Alternatively, they were to undertakethe steps that follow.
Verifying student listsStudent lists were checked to confirm a 1984 (orwithin the agreed-upon time span) year of birthor grade specified by the NPM (if grade samplingtook place) and when this was not the case, thestudents name was crossed out. The schools werecontacted regarding any questions or discrepancies.Students not born in 1984 (or within the agreed-upon time span, or grade if grade sampling wasto be done) were to be deleted from the list (afterconfirming with the school). Duplicates weredeleted from the list before sampling.
Numbering students on the listUsing the line number column on the StudentListing Form (or the margin if using anotherlist), eligible students on the list were numberedconsecutively for each school. The numberingwas checked for repetition or ellipses, andcorrected before the student sample was selected.
Computing the sampling interval and random start
After verifying the list, the total number of 15-year-old students (born in 1984 or the agreed-upon time span) was recorded in Box A on theStudent Tracking Form (see Appendix 11). Thiswould be the same number as the last linenumber on the list, if no grade sampling werebeing done. The total number of listed studentswas entered in Box B. The desired sample size(usually 35) was generally the same for eachschool and was entered in Box C on the StudentTracking Form. A random number was chosenfrom the RN table and recorded as a four-digitdecimal (e.g., 3 279 = 0.3279) in Box D.
Whether or not grade sampling was being done,the sampling interval was computed by dividingthe total number of students aged 15 (Box A) bythe number in Box C, the planned student samplesize for each school, and recorded in Box E. If thenumber was less than 1.0000, a 1 was recorded. Ifgrade sampling was being done, this interval wasapplied to the larger list (grade and age-eligiblestudents) to generate a larger sample.
A random start point was computed by
multiplying the RN decimal by the samplinginterval plus one, and the result was recorded inBox F. The integer portion of this value was theline number of the first student to be selected.
Determining line numbers and selectingstudents
The sampling interval (Box E) was added to thevalue in Box F to determine the line numbers ofthe students to be sampled until the largest linenumber was equal to or larger than the totalnumber of students listed (Box B). Each linenumber selected was to be recorded in wholeintegers. Column (2) on the Student TrackingForm was used to record selected line numbersin sequence. NPMs were instructed to write an Sin front of the line number on the originalstudent list for each student selected. If a gradesample was being selected, the total studentsample was at least as large as (probablysomewhat larger than) the number in Box C,and included some age-eligible (about equal tothe number in Box C) and grade-eligiblestudents.
Preparing the Student Tracking FormOnce the student sample was selected, theStudent Tracking Form for each school could becompleted by the NPM. The Student TrackingForm was the central administration documentfor the study. When a Student Tracking Formwas sent to a school, it served as the completelist of the student sample. Once booklets wereassigned to students, the Student Tracking Formbecame the link between the students, the testbooklets, and background questionnaires thatthey received. After the testing, the results of thesession (in terms of presence, absence, exclusion,etc.) were entered on the Student Tracking Formand summarised. The Student Tracking Formwas sent back to the national centre with the testinstruments and was used to make sure that allmaterials were accounted for correctly.
The Student Tracking Form was completed asfollows:• The header of the Tracking Form was
prepared with the country name, the schoolname and the school identifier.
• Boxes A to F were checked for completion.These boxes provided the documentation forthe student sampling process.
• The names of the selected students were listed
in column (3) of the Student Tracking Form.The line number for each selected student,from the Listing Form, had already beenentered in column (2). The pre-printed studentnumber in column (1) was the student’s new‘ID’ number which became part of a uniquenumber used to identify the studentthroughout the assessment process.The Student Tracking Form, which consisted
of two sides, was designed to accommodatestudent samples of up to 35 students, which wasthe standard PISA design. If the student samplewas larger than 35, NPMs were instructed to usemore than one Student Tracking Form, and onthe additional forms, to renumber the pre-printed identification numbers in column (1) tocorrespond to the additional students. Forexample, column (1) of the second form wouldbegin with 36, 37, 38, etc.• For each sampled student, the student’s grade,
date of birth and sex were entered on theStudent Tracking Form. It was possible forNPMs to modify the form to include otherdemographic variables.
• The ‘Excluded’ column was to be left blankby the NPM. This column was to be used bythe school to designate any students whocould not be tested due to a disability orlimited language proficiency.
• If booklet numbers were to be assigned tostudents at the national centre, the bookletnumber assigned to each student was enteredin column (8). Participation status on theStudent Tracking Form was to be left blank atthe time of student sampling. Thisinformation would only be entered at theschool when the assessment was administered.
Preparing instructions for excluding studentsPISA was a timed assessment administered in theinstructional language(s) of each country anddesigned to be as inclusive as possible. Forstudents with limited language proficiency orwith physical, mental, or emotional disabilitieswho could not participate, PISA developedinstructions in cases of doubt about whether aselected student should be assessed.
NPMs used the guidelines in Figure 5 todevelop instructions; School Co-ordinators andTest Administrators7 needed precise instructionsfor exclusions. The national operationaldefinitions for within-school exclusions were to
Op
erat
ion
s
55
7 The roles of School Co-ordinators and TestAdministrators are described in detail in Chapter 6.
be well documented and submitted to theConsortium for review before testing.
Sending the Student Tracking Form to theSchool Co-ordinator and Test Administrator
The School Co-ordinator needed to knowwhich students were sampled in order to notifythem and their teachers (and parents), to updateinformation and to identify the students to beexcluded. The Student Tracking Form andGuidelines for Excluding Students were thereforesent about two weeks before the assessmentsession. It was recommended that a copy of theTracking Form be made and kept at the nationalcentre. Another recommendation was to havethe NPM send a copy of the form to the TestAdministrator with the assessment booklets andquestionnaires in case the school copy wasmisplaced before the assessment day. The TestAdministrator and School Co-ordinator manuals(see Chapter 6) both assumed that each wouldhave a copy.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
56
INSTRUCTIONS FOR EXCLUDING STUDENTS
The following guidelines define general categories for the exclusion of students withinschools. These guidelines need to be carefully implemented within the context of eacheducational system. The numbers to the left are codes to be entered in column (7) of theStudent Tracking Form to identify excluded students.
1 = Functionally disabled students. These are students who are permanently physicallydisabled in such a way that they cannot perform in the PISA testing situation.Functionally disabled students who can respond to the test should be included in thetesting.
2 = Educable mentally retarded students. These are students who are considered in theprofessional opinion of the school principal or by other qualified staff member to beeducable mentally retarded or who have been psychologically tested as such. Thisincludes students who are emotionally or mentally unable to follow even the generalinstructions of the test. However, students should not be excluded solely because ofpoor academic performance or disciplinary problems.
3 = Students with limited proficiency in the test language. These are students who areunable to read or speak the language of the test and would be unable to overcomethe language barrier in the test situation. Typically, a student who has received lessthan one year of instruction in the language of the test should be excluded, but thisdefinition may need to be adapted in different countries.
4 = Other.
It is important that these criteria be followed strictly for the study to be comparablewithin and across countries. When in doubt, the student was included.
Figure 5: Student Exclusion Criteria as Conveyed to National Project Managers
Translation and CulturalAppropriateness of the Testand Survey Material
Aletta Grisay
Introduction
Translation errors are known to be a majorcause for items to function poorly in internationaltests. They are much more frequent than otherproblems, such as clearly identified discrepanciesdue to cultural biases or curricular differences.
If a survey is done merely to rank countries orstudents, this problem can be avoided somewhatsince once the most unstable items have beenidentified and dropped, the few remainingproblematic items are unlikely to affect the overallestimate of a country’s mean in any significant way.
The aim of PISA, however, is to developdescriptive scales, and in this case translationerrors are of greater concern. Theirinterpretation can be severely biased by unstableitem characteristics from one country to another.PISA has therefore implemented stricterverification procedures for translationequivalence than those used in many priorsurveys. This chapter describes these proceduresand their implementation and outcomes.
A number of quality assurance procedureswere implemented in the PISA 2000 assessmentto ensure equivalent national test andquestionnaire materials. These included:• providing two parallel source versions (in
English and French), and recommending thateach country develop two independentversions in their instruction language (onefrom each source language), then reconcilethem into one national version;
• systematically adding information to the testand questionnaire materials to be translatedabout Question Intent, to clarify the scope
and characteristics, and appending frequentTranslation Notes for possible translation oradaptation problems;
• developing detailed translation/adaptationguidelines for the test material, and forrevising it after the field trial, as an importantpart of the PISA National Project ManagerManual;
• training key staff from each national team onrecommended translation procedures; and
• appointing and training a group ofinternational verifiers (professional translatorsproficient in English and French, with nativecommand of each target language), to verifythe national versions against the sourceversions.
Double Translation fromTwo Source Languages
A back translation procedure is the mostfrequently used to ensure linguistic equivalenceof test instruments in international surveys. Itrequires translating the source version of the test(generally English language) into the nationallanguages, then translating them back to andcomparing them with the source language toidentify possible discrepancies.
This technique is relatively effective fordetecting mistranslation or major interpretationproblems (Hambleton et al., in press). Forexample, in the original English version of oneof the PISA reading texts proposed for the fieldtrial, no lexical or grammatical clue enabled thereader to identify the main character’s gender.
Chapter 5
Many languages imposed a choice in almostevery sentence in which the character occurs.The comparison of several back translationsalmost inevitably brings out this type ofproblem.
Back translation has a serious deficiency,however, which has often been pointed out. Inmany cases, the translation is incorrect becauseit is too literally transposed, but there is a fairlyhigh risk that the back translation would merelyrecover the original text without revealing theerror.
Somerset Maugham’s short story The Ant andthe Grasshopper was initially proposed as areading text for the PISA 2000 field trial. Bothtranslators who worked on the French sourceversion translated the content of the underlinedsentence in the following passage word forword:
‘In this admirable fable (I apologise fortelling something which everyone is politely,but inexactly, supposed to know) the antspends a laborious summer gathering itswinter store, while the grasshopper sits on ablade of grass singing to the sun.’
Translation 1:«Dans cette fable remarquable (que le
lecteur me pardonne si je raconte quelquechose que chacun est courtoisement censésavoir, mais pas exactement), la fourmiconsacre un été industrieux à rassembler desprovisions pour l’hiver, tandis que la cigale lepasse sur quelque brin d’herbe, à chanter ausoleil. »
Translation 2:«Dans cette fable admirable (veuillez
m’excuser de rappeler quelque chose quechacun est supposé connaître par politessemais pas précisément), la fourmi passe un étélaborieux à constituer des réserves pourl’hiver, tandis que la cigale s’installe surl’herbe et chante au soleil. »
Both translations are literally correct, andwould back-translate into an English sentencequite parallel to the original sentence. However,both are semantically unacceptable, since neitherreflects the irony of ‘politely, but inexactlysupposed to know’. The versions were reconciledand eventually translated into:
« Dans cette fable remarquable (que l’on mepardonne si je raconte quelque chose quechacun est supposé connaître – suppositionqui relève de la courtoisie plus que del’exactitude), la fourmi consacre un étéindustrieux à rassembler des provisions pourl’hiver, tandis que la cigale le passe surquelque brindille, à chanter au soleil. »
Both translations 1 and 2 would probablyappear to be correct when back-translated,whereas the reconciled version would appearfurther from the original.
It is also interesting to note that both Frenchtranslators and the French reconciler preferred(rightly so, for a French-speaking audience) torevert to the cigale [i.e., cicada] of La Fontaine’soriginal text, while Somerset Maugham adaptedthe fable to his English-speaking audience byreferring to a grasshopper [which would havebeen sauterelle in French]. However, bothtranslators neglected to draw the entomologicalconsequences from their return to the original:they were too faithful to the English text andallowed a strictly arboreal insect [the cicada] tolive on a brin d’herbe [i.e., a blade of grass], thatthe reconciler replaced by a brindille [i.e., atwig]. In such a case, the back translationprocedure would consider cigale and brindille asdeviations or errors.
A double translation procedure (i.e., twoindependent translations from the sourcelanguage, and reconciliation by a third person)offers significant advantages in comparison withthe back translation procedure:• Equivalence of the source and target
languages is obtained by using three differentpeople (two translators and a reconciler) whoall work on the source and the target versions.In the back translation procedure, by contrast,the first translator is the only one to focussimultaneously on the source and targetversions.
• Discrepancies are recorded directly in thetarget language instead of in the sourcelanguage, as would be the case in a backtranslation procedure.These examples are deliberately borderline
cases, where both translators happened to makea common error (that could theoretically havebeen overlooked, had the reconciliation havebeen less accurate). But the probability of
PIS
A 2
00
0 t
ech
nic
al r
epor
t
58
detecting errors is obviously considerably higherwhen three people rather than one compare thesource language with the target language.
A double translation procedure was used inthe Third International Mathematics and ScienceStudy-Repeat (TIMSS-R) instead of the backtranslation procedure used in earlier studies bythe International Association for the Evaluationof Educational Achievement (IEA).
PISA used double translation from twodifferent languages because both backtranslation and double translation proceduresfall short in that the equivalence of the variousnational versions depends exclusively on theirconsistency with a single source version (ingeneral, English). This leads to implicitly givingmore weight than would be desirable on thosecultural forms related to the reference language.
Furthermore, one would wish for as purely asemantic equivalence as possible (since theprinciple is to measure access that students fromdifferent countries would have to a samemeaning, through written material presented indifferent languages). However, using a singlereference language is likely to give moreimportance than would be desirable to theformal characteristics of that language. If asingle source language is used, its lexical andsyntactic features, stylistic conventions andtypical organisational patterns of ideas withinthe sentence will have more impact thandesirable on the target language versions.
Other expected advantages from using twosource languages included:• The verification of the equivalence between
the source and the target versions wasperformed by four different people who allworked directly on the texts in the relevantnational languages (i.e., two translators, onenational reconciler and the Consortium’sverifier).
• The degree of freedom to take with respect toa source text often causes translationproblems. A translation that is too faithfulmay appear awkward; if it is too free or tooliterary it is very likely to fail to be equivalent.Having two source versions in differentlanguages (for which the translationfidelity/freedom has been carefully calibratedand approved by Consortium experts) providesbenchmarks for a national reconciler that arefar more accurate in this respect, that neither
Tra
nsl
atio
n a
nd
cu
ltu
ral
app
rop
riat
enes
s of
th
e te
st a
nd
su
rvey
mat
eria
l
59
back translation nor double translation froma single language could provide.
• Many translation problems are due toidiosyncrasies: words, idioms, or syntacticstructures in one language appearuntranslatable into a target language. In manycases, the opportunity to consult the othersource version will provide hints at solutions.
• Similarly, resorting to two different languageswill, to a certain extent, tone down problemslinked to the impact of cultural characteristicsof a single source language. Admittedly, bothlanguages used here share an Indo-Europeanorigin, which may be regrettable in thisparticular case. However, they do representsets of relatively different cultural traditions,and are both spoken in several countries withdifferent geographic locations, traditions,social structures and cultures.Nevertheless, as all major international
surveys prior to PISA always used English astheir source language, empirical evidence toinform on the consequences of using analternative reference language was lacking. Asfar as we know, the only interesting findings inthis respect were reported in the IEA/ReadingComprehension survey (Thorndike, 1973),which showed a better item coherence (factorialstructure of the tests, distribution of thediscrimination coefficients) between English-speaking countries than across otherparticipating countries. In this perspective, thedouble source procedure used in PISA was quiteinnovative.
Development of SourceVersions
Participating countries contributed most ofthe text passages used as stimuli in the PISA fieldtrial and some of the items: the material retainedin the field trial contained submissions from18 countries.1 An alternative source version was
1 Other than material sourced from IALS, 45 unitswere included in the field trial, each with its own setof stimulus materials. Of these, 27 sets of stimuli werefrom five different English-speaking countries while18 came from countries with 12 other languages(French 4, Finnish 3, German 2, Danish 1, Greek 1,Italian 1, Japanese 1, Korean 1, Norwegian 1, Russian1, Spanish 1 and Swedish 1) and were translated intoEnglish to prepare the English source version, whichwas then used to prepare the French version.
developed mainly through double translating anessentially English first source version of thematerial and reconciling it into French.
A group of French domain experts, chaired byMrs. Martine Rémond (one of the PISA ReadingFunctional Expert Group members), wasappointed to review the French source version forequivalence with the English version, forlinguistic correctness and for the appropriatenessof the terminology used (particularly formathematics and science material).
Time constraints led to starting the processbefore the English version was finalised. This putsome pressure on the French translators, whohad to carefully follow all changes made to theEnglish originals. It did however make itpossible for them to detect a few residual errorsoverlooked by the test developers, and toanticipate potential translation problems. Inparticular, a number of ambiguities or pitfallexpressions could be spotted and avoided fromthe beginning by slightly modifying the sourceversions, and the list of aspects requiringnational adaptations could be refined, andfurther translation notes could be added when aneed was identified. In this respect, thedevelopment of the French source version servedas a translation trial, and probably helpedprovide National Project Managers (NPMs) withsource material that was somewhat easier totranslate or contained fewer potential translationtraps than it would have had if a single sourcehad been developed.
Two additional features were embedded in bothsource versions to help translators. Translationnotes were added wherever the test developersthought it necessary to draw the translator’sattention to some important aspect, e.g.:• Imitate as closely as possible some stylistic
characteristic of the source version.• Indicate particular cases where the wording of
a question must be the same (or NOT thesame) as in a specific sentence in the stimulus;
• Alert to potential difficulties or translationtraps.
• Point out aspects for which a translator isexpressly asked to use national adaptations.
• A short description of the Question Intent wasprovided for each item, to help translatorsand test markers to better understand thecognitive process that the test developerswished to assess. This proved to be
particularly useful for the reading material,where the Question Intent descriptions oftencontained important information such aswhether a question:• required a literal match between the ques-
tion stem and the corresponding passage inthe text, or whether a synonym or paraphraseshould be provided in the question stem;
• required the student to make someinference or to find some implicit linkbetween the information given in variousplaces in the text. In such cases, forexample, the PISA Translation Guidelines(see next section) instructed the translatorsthat they should never add connectorslikely to facilitate the student’s task, such ashowever, because, on the other hand, bycomparison, etc., where the author did notput any (or the converse: never forget touse a connector if one had been used by theauthor); or
• was intended to assess the student’sperception of textual form or style. In suchcases, it was important that the translationconvey accurately such subtleties of thesource text as irony, word colour andnuances in character motives.
PISA TranslationGuidelines
NPMs also received a document with detailedinformation on: • PISA requirements in terms of necessary
national version(s). PISA takes as a generalprinciple that students should be tested in thelanguage of instruction used in their school.Therefore, the NPMs of multilingual countrieswere requested to develop as many versions ofthe test instruments as there were languages ofinstruction used in the schools included intheir national sample. Cases of minoritylanguages used in only a very limited numberof schools could be discussed with thesampling referee to decide whether suchschools could be excluded from the targetpopulation without affecting the overallquality of the data collection.
• Which parts of the materials had to be doubletranslated, or could be single translated.Double-translation was required for the
PIS
A 2
00
0 t
ech
nic
al r
epor
t
60
cognitive tests, questionnaires and for theoptional Computer Familiarity or InformationTechnology (IT) and Cross-CurriculumCompetency (CCC) instruments, but not forthe manuals and other logistic material.
• Instructions related to the recruitment oftranslators and reconcilers, their training, andthe scientific and technical support to beprovided to the translation team. It wassuggested, in particular, that translatedmaterial and national adaptations deemednecessary for inclusion be submitted forreview and approval to a national expertpanel composed of domain specialists.
• Description of the PISA translationprocedures. It was required that nationalversion(s) be developed by double translationand reconciliation with the source material. Itwas recommended that one independent trans-lator use the English source version and thatthe second use the French version. In countrieswhere the NPM had difficulty appointingcompetent French translators, doubletranslation from English only was consideredacceptable according the PISA standards.Other sections of the PISA Translation
Guidelines were more directly intended for useby the national translators and reconciler(s):• Recommendations to prevent common
translation traps a rather extensive sectiongiving detailed examples on problems frequentlyencountered when translating surveymaterials, and advice on how to avoid them;
• Instructions on how to use translation notesand descriptions of Question Intents includedin the material;
• Instructions on how to adapt the material tothe national context, listing a variety of rulesidentifying acceptable/unacceptable nationaladaptations;
• Special notes on translating mathematics andscience material;
• Special notes on translating questionnairesand manuals; and
• a National Adaptation Form, to documentnational adaptations included in the material.An additional section of the Guidelines was
circulated to NPMs after the field trial, togetherwith the revised materials to be used in the mainstudy, to help them and their translation teamrevise their national version(s).
Translation TrainingSession
NPMs received sample material to use forrecruiting national translators and training themat the national level. The NPM meeting held inNovember 1998, prior to the field trialtranslation activities, included a training sessionfor members of the national translation teams(or the person responsible for translationactivities) from the participating countries. Adetailed presentation was made of the material,of recommended translation procedures, of theTranslation Guidelines, and verification process.
InternationalVerification of theNational Versions
One of the most productive quality controlprocedures implemented in PISA to ensure highquality standards in the translated assessmentmaterials consisted in having a team ofindependent translators, appointed and trainedby the Consortium, verify each national versionagainst the English and French source versions.
Two Verification Co-ordination centres wereestablished. One was at the Australian Concilfor Educational Research (ACER) in Melbourne(for national adaptations used in the English-speaking countries and national versionsdeveloped by participating Asian countries). Thesecond one was at cApStAn, a translation firm inBrussels (for all other national versions,including the national adaptations used in theFrench-speaking countries). cApStAn had beeninvolved in preparing the French source versionof the PISA material, and was retained bothbecause of its familiarity with the study materialand because of its large network of internationaltranslators.
The Consortium undertook internationalverifications of all national versions in languagesused in schools attended by more than 5 per centof the country’s target population. For languagesused in schools attended by 5 per cent or lessminorities, international-level verification wasdeemed unnecessary since the impact on thecountry results would be negligible, andverification of very low frequency languages wasmore feasible at national level. In Spain, for
Tra
nsl
atio
n a
nd
cu
ltu
ral
app
rop
riat
enes
s of
th
e te
st a
nd
su
rvey
mat
eria
l
61
instance, the Consortium took up theverification of the Castilian and Catalan testmaterial (respectively 72 and 16 per cent of thetarget population), while the Spanish NationalCentre was responsible for verifying theirGalician, Valencian and Basque test material atnational level (respectively 3, 4 and 5 per cent ofthe target population).
For a few minority languages, nationalversions were only developed (and verified) inthe main study phase. This was consideredacceptable when a national centre had arrangedwith another PISA country to borrow its fieldtrial national version for their minority (adaptingthe Swedish version for the Swedish schools inFinland, the Russian version for Russian schoolsin Latvia), and when the minority language wasconsidered to be a dialect that differed onlyslightly from the main national language(Nynorsk in Norway).
A few English or French-speaking countries orcommunities (Canada (Fr.), Ireland, Switzerland(Fr.) and United Kingdom) were constrained byearly testing windows, and submitted onlyNational Adaptation Forms for verification. Thiswas also considered acceptable, since thesecountries used national versions that wereidentical to the source version except for thenational adaptations.
The main criteria used to recruit translators tolead the verification of the various nationalversions were that they were (or had):• native speakers of the target language;• experienced translators from either English or
French into their target language;• sufficient command of the second source
language (either English or French) to be ableto use it for cross-checks in the verification ofthe material; and
• experience as teachers and/or have highereducation degrees in psychology, sociology oreducation, as far as possible.All appointed verifiers met the first two
requirements; fewer than 10 per cent of them didnot have sufficient command of the Frenchlanguage to fully meet the third requirement;and two-thirds of the verifiers met the lastrequirement. The remaining one-third had higherdegrees in other fields, such as philosophy orpolitical and social sciences.
As a general rule, the same verifiers were usedfor homo-lingual versions (i.e., the various
national versions from English, French, German,Italian and Dutch-speaking countries orcommunities). However, the Portuguese languagediffers significantly from Brazil to Portugal, andthe Spanish language is not the same in Spainand in Mexico, so independent native translatorshad to be appointed for those four countries.
In a few cases, both in the field trial and themain study verification exercises, the timeconstraints were too tight for a single person tomeet the deadlines, and additional translatorshad to be appointed and trained.
Verifier training sessions were held inMelbourne and in Brussels, prior to theverification of both the field trial and the mainstudy material. Attendees received copies of thePISA information brochure, TranslationGuidelines, the English and French sourceversions of the material and a Verification CheckList developed by the Consortium. When madeavailable by the countries, a first bundle oftarget material was also delivered. The trainingsession focused on:• presenting verifiers with PISA objectives and
structure;• familiarising them with the material to be
verified;• reviewing and extensively discussing the
Translation Guidelines and the VerificationCheck List;
• checking that all verifiers who would work onelectronic files knew how to deal with someimportant Microsoft Word® commands (trackchanges, comments, edit captions in graphics,etc.) and (if hard copies only had beenprovided) that they knew how to use standardinternational proof-reading abbreviations andconventions;
• arranging for schedules and for dispatchlogistics; and
• insisting on security requirements.Verification for the field trial tests required on
average 80 to 100 hours per version. Somewhatless time was required for the main studyversions. Most countries submitted their materialin several successive shipments, sometimes witha large delay that was inconsistent with theinitially negotiated and agreed VerificationSchedule. The role of the verification co-ordinators proved to be particularly important inensuring continued and friendly contact with thenational teams, maintaining flexibility in the
PIS
A 2
00
0 t
ech
nic
al r
epor
t
62
availability of the verifiers, keeping records ofthe material being circulated, submitting doubtfulcases of too-free national adaptations to the testdevelopers, forwarding all last minute changesand edits introduced in the source versions bythe test developers to the verifiers, and archivingthe reviewed material and verification reportsreturned by the verifiers.
The PISA 2000 Field Trial led to theintroduction of a few changes in the main studyverification procedure:• Manuals to be used by the School Co-ordinator
and the Test Administrator should be includedin the material to be internationally verified inorder to help prevent possible deviations fromthe standard data collection procedures.
• Use of electronic files rather than hard copiesfor verifying the item pool and marking guideswas recommended in order to save time andto revise the verified material by the nationalteams more easily and more effectively.
• NPMs were asked to submit hard copies oftheir test booklets before sending them to theprinter, so that the verifiers could check thefinal layout and graphic format, and theinstructions given to the students. This alsoallowed them to check the extent to whichtheir suggestions had been implemented.
Translation andVerification ProcedureOutcomes
In an international study where reading was themajor domain, the equivalence of the nationalversions and the suitability of the proceduresemployed to obtain it was, predictably, the focusof many discussions both among the Board ofParticipating Countries (BPC) members andamong national research teams. The followingquestions in particular were raised.• Is equivalence across languages attainable at
all? Many linguistic features differ sofundamentally from language to language thatit may be unrealistic to expect equal difficultyof the reading material in all versions.
• In particular, the differences in the length oftranslated material from one language toanother are well known. In a time-limitedassessment, won’t students from certaincountries be at some disadvantage if all
stimuli and items are longer in their languagethan in others, requiring them to use moretime to read, and therefore decreasing thetime available for answering questions?
• Is double translation really preferable to otherwell-established methods such as back-translation?
• Is there a risk that using two source languageswill result in more rather than fewertranslation errors?
• What is the real value of such a costly andtime-consuming exercise as internationalverification? Wouldn’t the time and resourcesbe better spent on additional nationalverifications?
• Will the verifiers appointed by theConsortium be competent enough, whencompared to national translators chosen by anNPM and working under his or hersupervision?Concerns about the value of the international
verification were usually allayed when the NPMsstarted receiving their verified material.Although the quality of the national translationswas rather high (and, in some cases,outstanding) with only very few exceptions,verifiers still found many flaws in virtually allnational versions. The most frequent of themincluded:• Late corrections introduced in the source
versions of the material had been overlooked.• New mistakes had been introduced when
entering revisions (mainly typographicalerrors or subject-verb agreement errors).
• Graphics errors (keys, captions, textembedded in graphics). Access to a number ofthese appeared to be very difficult, partlybecause English words tend to be rather short,and the size of the text frames often lackedspace for longer formulations. The resultswere sometimes barely legible.
• Layout and presentation (frequent alignmentproblems and font or font size discrepancies).
• Line numbers (frequent changes in the layoutresulted in incorrect references to linenumbers).
• Incorrect quotations from the text, mostly dueto edits implemented in the stimulus withoutmaking them in question stems or scoringinstructions.
• Missing words or sentences.• Cases where a question contained a literal
Tra
nsl
atio
n a
nd
cu
ltu
ral
app
rop
riat
enes
s of
th
e te
st a
nd
su
rvey
mat
eria
l
63
match in the source versions but asynonymous match in the translated version—often due to the well-known translator’ssyndrome of avoiding double and triplerepetitions in a text.
• Inconsistencies in punctuation, instructions tostudents, ways of formulating the question,spelling of certain words, and so on.The verifiers’ reports also provided some
interesting indications of three types oftranslation errors that seemed to occur lessfrequently in national versions developedthrough double translation from the two sourceversions rather than through single translation ordouble translation from a single-source version.These included mistranslations; literal translationsof idioms or colloquial expressions (loantranslations); incorrect translation of questionswith stems such as ‘which of the following…’into national expressions meaning ‘which one ofthe following…’. A cross-check with the Frenchsource version (‘Laquelle ou lesquelles despropositions suivantes…’) would have alertedthe translator to the fact that the student wasexpected to provide one or more answers.
Based on the data collected in the field trial, afew empirical analyses could be made to exploresome of the other issues raised by the BPCmembers and the NPMs focused on twoquestions. To what extent could the English andFrench source versions be considered‘equivalent’? Did the recommended procedureactually result in better national versions thanother procedures used by some of theparticipating countries?
To explore these two issues, the followinganalyses were performed:• Since reading was the major domain in the
PISA 2000 assessment, it was particularlycrucial that the stimuli used in the test unitsbe as equivalent as possible in terms oflinguistic difficulty. Length, and a fewindicators of lexical and syntactic difficulty ofthe English and French versions of a sampleof PISA stimuli, were compared, usingreadability formulae to assess their relativedifficulty in both languages.
• The field trial data from the English andFrench-speaking countries or communitieswere used to check whether the psychometriccharacteristics of the items in the versionsadapted from the English source were similar
PIS
A 2
00
0 t
ech
nic
al r
epor
t
64
to those in the versions adapted from the Frenchsource. In particular, it was important to knowwhether any items showed flaws in all or mostof the French-speaking countries or com-munities, but none in the English-speakingcountries or communities, or vice versa.
• Based on the field trial statistics, the nationalversions developed through the recommendedprocedure were compared with those obtainedthrough alternative procedures to identify thetranslation methods that produced fewestflawed items.
Linguistic Characteristics of theEnglish and French Source Versionsof the Stimuli
Differences in length
The length of stimuli was compared using thetexts included in 50 reading and seven science orIntegrated units.2 No mathematics units wereincluded, since they had no or very short textstimuli. Some of the reading and science unitswere also excluded, since the texts were veryshort or they had only tables.
The French version of the stimuli proved to besignificantly longer than the English version.There were, on average, 12 per cent more wordsin French (410 words in the French stimuli onaverage compared to 367 in English). In addition,the average length of words is greater in French(5.09 characters per word, versus 4.83 charactersper word in English). So the total charactercount was considerably higher in French (onaverage, 18.84 per cent more characters in theFrench than in the English version of the sampledstimuli). This did not affect the relative length ofthe passages in the two languages, however. Thecorrelation between both the French and Englishword counts and between the French andEnglish character count was 0.99.
Some variation was observed from text to textin the increase in length from English to French.There was some evidence that texts that wereoriginally in French or in languages other thanEnglish tended to have fewer words (Pole Sud:2 per cent fewer; Shining Object: 4 per centfewer; Macondo: 5 per cent fewer) or to have
2 Integrated units were explored in the field trial, butthey were not used in the main study.
only minor differences (Police: 2.5 per centmore; Amanda and the Duchess: 1.2 per centmore; Rhinoceros: 5 per cent more; Just Judge: 1.6 per cent more; Corn: 4 per cent more).
Differences in text length and item difficultyForty-nine prose texts of sufficient length (morethan 150 words in the English version) wereretained for an analysis on the possible effects onitem difficulty of the higher word countassociated with translation into French. For tenof them, the French version was shorter than theEnglish version or very similar in length (lessthan a 5 per cent increase in the word count).Ten others had a 6 to 10 per cent increase in theFrench word count. Fifteen had an increase from11 to 20 per cent, and the remaining 14 had anincrease of more than 20 per cent.
Eleven PISA countries had English or Frenchas (one of) their instruction language(s). TheEnglish source version of the field trialinstruments was used, with a few nationaladaptations, in Australia, Canada (Eng.),Ireland, New Zealand, United Kingdom andUnited States. The French source version wasused, with a few national adaptations, inBelgium (Fr.), Canada (Fr.), France, Luxembourgand Switzerland (Fr.).
The booklets used in Luxembourg, however,contained both French and German material,which made it impossible to use the statistics
from Luxembourg for a comparison between theFrench and English data. Therefore the analysiswas done using the item statistics from theremaining 10 countries: six countries in theAdaptation from English group, and four in theAdaptation from French group.
In the field trial, the students answered a totalof 277 questions related to the selected stimuli (five or six questions per text). The overallpercentage of correct answers to the subset ofquestions related to each text was computed foreach country in each language group. Table 17shows the average results by type of stimuli.
In both English and French-speaking countries,the item difficulty appeared to be higher in thegroup of test units where the French stimuli wereonly slightly longer than the English version,whereas, when the French stimuli weresignificantly longer, it proved to be easier.
The mean per cent of correct answers wasslightly higher in the English-speaking countriesor communities for all groups of test units, butmore so for the groups of units containing thestimuli with the largest increase in length in theFrench version. This group-by-languageinteraction was significant (F=3.62, p<0.5),indicating that the longer French units tended tobe (proportionally) more difficult for French-speaking students than those with only slightlygreater word count.
65
Table 17: Percentage of Correct Answers in English and French-Speaking Countries or Communities forGroups of Test Units with Small or Large Differences in Length of Stimuli in the Source Languages
English-Speaking Countries French-Speaking Countriesor Communities or Communities
Word Count in Mean Standard Mean StandardFrench versus English Per Cent Deviation Per Cent Deviation
Correct Correct
French less than 5 per cent longer 59.7 11.3 58.1 13.4than English (10 units)
French between 6 and 10 percent 63.1 13.4 59.1 15.0longer than English (10 units)
French between 11 and 20 per cent 65.0 13.7 62.1 14.1longer than English (15 units)
French more than 20 per cent 68.2 11.8 63.7 12.6longer than English (14 units)
Note: N=490 (10 countries and 49 texts). Tra
nsl
atio
n a
nd
cu
ltu
ral
app
rop
riat
enes
s of
th
e te
st a
nd
su
rvey
mat
eria
l
This pattern suggests that the additionalburden to the reading tasks in countries usingthe longer version does not seem to besubstantial, but the hypothesis of some effect onthe students’ performance cannot be discarded.
Linguistic difficultyReadability indices were computed for the Frenchpassages, using Flesch-Delandsheere(De Landsheere, 1973; Flesch, 1949) and Henry(1975) formulae. For some 26 English texts, read-ability indices were computed using Fry (1977),Dale and Chall (1948) and Spache (1953)formulae.
Most of these formulae use similar indicatorsto quantify the linguistic density of the texts. Themost common indicators, included in both theFrench and English formulae, are: (i) averagelength of words (lexical difficulty); (ii) percentageof low-frequency words (abstractness); and (iii) average length of sentences (syntacticalcomplexity).
Metrics vary across languages, and a numberof other factors—sometimes language-specific—are used in each of the formulae, which preventstrue direct comparison of the indices obtainedfor the English and French versions of a sametext. Therefore, the means in Table 18 belowmust be considered with caution, whereas thecorrelations between the English and Frenchindices are more reliable.
The correlations were all reasonably high orvery high, indicating that the stimuli with higherindices of linguistic difficulty in English tended
to also have higher difficulty indices than otherpassages in French. That is, English texts withmore abstract or more technical vocabulary, orwith longer and more complex sentences, tendedto show the same characteristics when translatedinto French.
Psychometric Quality of the Frenchand English Versions
Using the statistics from the field trial itemanalyses, all items presenting one or more of thefollowing flaws were identified in each nationalversion of the instruments:• a DIF (i.e., the item proved to be significantly
easier (or harder) than in most otherversions);
• fit index was too high (> 1.20); or• discrimination index was too low (< 0.15).
Some 30 items (of the 561 reading, mathem-atics, and science items included in the field trialmaterial) appeared to have problems in all ormost of the 32 participating countries. Since, inthese cases, one can be rather confident that theflaw had to do with the item content rather thanwith the quality of the translation, all relatedobservations were discarded from thecomparisons. A few items with no statistics insome of the countries (items with 0 per cent or100 per cent correct answers) also had to bediscarded. Table 19 shows the distribution offlawed items in the English-speaking and French-speaking countries or communities.
Table 19 clearly shows a very similar pattern
PIS
A 2
00
0 t
ech
nic
al r
epor
t
66
Table 18: Correlation of Linguistic Difficulty Indicators in 26 English and French Texts
English French Correlation
Mean Standard Mean StandardDeviation Deviation
Average Word Length 4.83 char. 0.36 5.09 char. 0.30 0.60
Percentage of Low-Frequency 18.7% 8.6 21.5% 5.4 0.61Words
Average Sentence Length 18 words 6.7 21 words 7.1 0.92
Average Readability Index Dale-Chall: Dale-Chall: Henry: Henry:32.84 9.76 0.496 0.054 0.72
Flesch-DL: Flesch-DL:38.62 4.18 0.83
in the two groups of countries. The percentageof flawed items in the field trial material variedfrom 5.8 to 9.4 per cent in the English-speakingcountries or communities and from5.3 to 8.9 per cent in the French-speaking countriesor communities, and had almost identical means(English: 7.5 per cent, French 7.7 per cent;F=0.05, p>0.83).
The details by item show that only one item(R228Q03) was flawed in all four French-speakingcountries or communities, but in none of theEnglish-speaking countries or communities. Threeother items (R069Q03A, R069Q05 and R247Q01)were flawed in three of four French-speakingcountries or communities, but in none of theEnglish-speaking countries or communities. Con-versely, only four items had flaws in all six English-speaking countries or communities or in four orfive of them, but only in one French-speakingcountry or community (R070Q06, R085Q06,R088Q03, R119Q10). None of the science or math-ematics items showed the same kind of imbalance.
Tra
nsl
atio
n a
nd
cu
ltu
ral
app
rop
riat
enes
s of
th
e te
st a
nd
su
rvey
mat
eria
l
67
Table 19: Percentage of Flawed Items in the English and French National Versions
Number of Too Too Large Low Total Items Easy Hard Fit Discrim- Percentage of
(%) (%) (%) ination Potentially (%) Flawed Items
Australia 532 0.4 0.0 5.6 3.0 7.7
Canada (Eng.) 532 0.0 0.6 3.4 3.6 6.6
Ireland 527 0.0 0.4 5.9 3.6 7.8
New Zealand 532 0.2 0.0 6.8 1.9 7.5
United Kingdom 530 0.0 0.0 8.9 1.5 9.4
United States 532 0.4 0.6 4.1 1.7 5.8
English-Speaking 3 185 0.1 0.3 5.8 2.5 7.5Countries or Communities
Belgium (Fr.) 531 0.0 0.6 5.6 3.8 7.9
Canada (Fr.) 531 0.0 0.2 5.6 5.6 8.9
France 532 0.2 0.9 2.8 2.4 5.3
Switzerland (Fr.) 530 0.0 0.2 3.6 6.2 8.7
French-Speaking 2 124 0.4 0.4 4.4 4.5 7.7Countries or Communities
Psychometric Quality of the National VersionsIn the countries with instruction languages otherthan English and French, the national versions ofthe PISA field trial instruments were developedeither through the recommended procedures, orthrough the following alternative methods.3
• Double translation from English, with cross-checks against French done in Denmark,Finland, Poland and in the German-speakingcountries or communities (Austria, Germany,Switzerland (Ger.)).
• Double translation from English, without cross-checks against French done in the Spanish andPortuguese-speaking countries or communities(Mexico and Spain, Brazil and Portugal).
• Single translation from English or fromFrench done in Greece, Korea, Latvia andthe Russian Federation. Most of the materialwas single-translated from English in thesecountries. In Greece and in Latvia, a small
3 Not all of these were approved.
part of the material was single-translated fromFrench and the remainder from English.
• Mixed methods e.g., Luxembourg hadbilingual booklets, the French material wasadapted from French, and the Germanmaterial was adapted from the ‘common’German version. Japan had reading stimulidouble translated from English and Frenchand the reading items and the mathematicsand science material double translated fromEnglish. Italy and Switzerland (Ital.) singletranslated the material, one from English, theother from French to reconcile the two
versions, but they ran out of time and wereable to reconcile only part of the readingmaterial. Therefore they both kept theremaining units single translated, with somechecks against the version derived from theother source language.In each country, potentially flawed items were
identified using the same procedure as for theEnglish and French-speaking countries orcommunities (explained above). Table 20 showsthe percentage of flawed items observed bymethod and by country.
The results show between-country variation in
PIS
A 2
00
0 t
ech
nic
al r
epor
t
68
Table 20: Percentage of Flawed Items by Translation Method
Method Mean Country Total PercentagePercentage Number of of Potentiallyof Flawed Items Flawed ItemsItems per
Translation Method
Adaptation from Source Version 7.6 Adapted from English 3 185 7.5Adapted from French 2 124 7.7
Double Translation from 8.0 Belgium (Fl.) 532 9.0English and French Hungary 532 7.5
Iceland 532 7.3Netherlands 532 10.3Norway 532 7.0Sweden 532 7.0
Double Translation from English 8.8 Austria 532 5.6(with cross-checks against French) Germany 532 7.9
Switzerland (Ger.) 532 10.5Denmark 532 7.5Finland 532 8.5Poland 532 12.8
Double Translation from English 12.1 Brazil 532 13.9(without use of French) Portugal 532 10.3
Mexico 532 16.0Spain 532 8.3
Single Translation 11.1 Greece 532 9.4Korea 532 16.0Latvia 532 9.2Russian Fed. 532 9.8
Other (Mixed Methods) 10.3 Czech Republic 532 9.4Italy 532 9.2Switzerland (Ital.) 532 13.9Japan 532 13.2Luxembourg 532 5.6
the number of flawed items within each group ofcountries using the same translation method, thisindicates that the method used was not the onlydeterminant of the psychometric qualityachieved in developing the instruments. Otherimportant factors were probably the accuracy ofthe national translators and reconcilers, and thequality of the work done by the internationalverifiers.
Significance tests of the differences betweenmethods are shown in Table 21. These resultsseem to confirm the hypothesis that therecommended procedure (Method B: doubletranslation from English and French) producednational versions that did not differ significantlyin terms of the number of flaws from theversions derived through adaptation from one ofthe source languages. Double translation from asingle language also appeared to be effective, butonly when accompanied by extensive cross-checks against the other source (Method C).
The average number of flaws was higher in allother groups of countries than in those that usedboth sources (either double translating from thetwo languages or using one source for doubletranslation and the other for cross-checks).
Methods E (single translation) and F (doubletranslation from one language, with no cross-checks) proved to be the least trustworthymethods.
Discussion
The analyses discussed in this chapter show thatthe relative linguistic complexity of the readingstimuli in English and French field trial versionsof the PISA assessment materials (as measuredusing readability formulae) was reasonablycomparable. However, the absolute differences inword and character counts between the twoversions were significant, and had a (modest)effect on the difficulty of the items associatedwith those stimuli that were much longer in theFrench version than in the English version.
The average length of words and sentencesdiffers across languages and obviously cannot beentirely controlled by translators when theyadapt test instruments. In this respect, it isprobably impossible to develop ‘true’ equivalentversions of tests involving large amounts ofwritten material. However, longer languagesoften have slightly more redundant T
ran
slat
ion
an
d c
ult
ura
l ap
pro
pri
aten
ess
of t
he
test
an
d s
urv
ey m
ater
ial
69
Table 21: Differences Between Translation Methods
B. C. D. E. F.Double Double Other Single Double
Translation Translation Methods Translation TranslationEnglish and Checks with No Checks
and French
A. A>B A>C A>D A>E A>FAdapted from Sources F=0.45 F=1.73 F=5.23 F=4.24 F=13.79
p=0.51 p=0.21 p=0.04 p=0.06 p=0.02
B. B>C B>D B>E B>FDouble Translation F= 0.45 F=2.27 F=4.37 F=7.12English and French p=0.52 p=0.17 p=0.07 p=0.03
C. C>D C>E C>FDouble Translation F=0.69 F=1.58 F=3.14and Checks p=0.43 p=0.24 p=0.11
D. D>E D>FOther Methods F=0.14 F=0.66
p=0.72 p=0.44
E. E>FSingle Translation F=0.19
p=0.68
morphological or syntactical characteristics,which may help compensate part of theadditional burden on reading tasks, especially intest situations like PISA, with no strongrequirements for speed.
No significant differences were observedbetween the two source versions in the overallnumber and distribution of flawed items.
With a few exceptions, the number oftranslation errors in the field trial nationalversions translated from the source materialsremained acceptable. Only nine participants hadmore than 10 per cent of flawed items comparedto the average of 7.6 per cent observed incountries that used one source version with onlysmall national adaptations.
The data seem to support the hypothesis thatdouble translation from both (rather than one)source versions resulted in better translations,with a lower incidence of flaws. Doubletranslation from one source with extensive cross-checks against the other source was an effectivealternative procedure.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
70
Field Operations
Nancy Caldwell and Jan Lokan
Overview of Roles andResponsibilities
The study was implemented in each country by aNational Project Manager (NPM) whoimplemented the procedures prepared by theConsortium. To implement the assessment inschools the NPMs were assisted by School Co-ordinators and Test Administrators. Each NPMtypically had several assistants, working from abase location that is referred to throughout thisreport as a ‘national centre’.
National Project Managers
National Project Managers (NPMs) wereresponsible for implementing the project withintheir own country. They selected the schoolsample, ensured that schools co-operated andthen selected the student sample from a list ofeligible students provided by each school (theStudent Listing Form). NPMs could followstrictly defined international procedures forselecting the school sample and then submitcomplete reports on the process to theConsortium, or provide complete lists of schoolswith age-eligible students and have theConsortium carry out the sampling.
NPMs were also given a choice of twomethods for selecting the student sample (seeChapter 4). These two methods were: (i) usingthe PISA student sampling software prepared bythe Consortium and (ii) selecting a random startnumber and then calculating the samplinginterval following the instructions in theSampling Manual. A complete list was prepared
for each school using the Student Tracking Formthat served as the central administrationdocument for the study and linked students, testbooklets and student questionnaires.
NPMs had additional operationalresponsibilities both before and after theassessment. These included:• hiring and training Test Administrators to
administer tests in individual schools;• scheduling test sessions and deploying
external Test Administrators (therecommended model);
• translating and adapting test instruments,manuals, and other materials into the testinglanguage(s);
• submitting translated documents to theConsortium for review and approval;
• sending instructions for preparing lists ofeligible students to the selected schools;
• assembling test booklets according to thedesign and layout specified by the Consortium;
• overseeing printing of test booklets andquestionnaires;
• maintaining ties with the School QualityMonitors appointed by the Consortium(monitoring activities are described in Chapter 7);
• co-ordinating the activities of TestAdministrators and School Quality Monitors;
• overseeing the packing and shipping of allmaterials;
• overseeing the receipt of test and othermaterials from the schools, and marking anddata entry;
• sending completed materials to theConsortium; and
Chapter 6
• preparing the NPM report of activities andsubmitting it to the Consortium.
School Co-ordinators
School Co-ordinators (SCs) co-ordinated school-related activities with the national centre and theTest Administrators. Their first task was toprepare the Student Listing Form with the namesof all eligible students in the school and to sendit to the NPM so that the NPM could select thestudent sample.
Prior to the test, the SC was to:• establish the testing date and time in
consultation with the NPM;• receive the list of sampled students on the
Student Tracking Form from the NPM andupdate it if necessary, plus identify studentswith disabilities or limited test languageproficiency who could not take the testaccording to criteria established by theConsortium;
• receive, distribute and collect the SchoolQuestionnaire and then deliver it to the TestAdministrator;
• inform school staff, students and parents ofthe nature of the test and the test date, andsecure parental permission if required by theschool or education system;
• inform the NPM and Test Administrator ofany test date or time changes; and
• assist the Test Administrator with roomarrangements for the test day.On the test day, the SC was expected to
ensure that the sampled students attended thetest session(s). If necessary, the SC also madearrangements for a make-up session and ensuredthat absent students attended the make-upsession.
Test Administrators
The Test Administrators (TAs) were primarilyresponsible for administering the PISA test fairly,impartially and uniformly, in accordance withinternational standards and PISA procedures. Tomaintain fairness, a TA could not be the reading,mathematics or science teacher of the studentsbeing assessed and it was preferred that they notbe a staff member at any participating school.
Prior to the test date, TAs were trained bynational centres. Training included a thorough
review of the Test Administrator Manual (seenext section) and the script to be followedduring the administration of the test andquestionnaire. Additional responsibilitiesincluded:• ensuring receipt of the testing materials from
the NPM and maintaining their security;• co-operating fully with the SC;• contacting the SC one to two weeks prior to
the test to confirm plans;• completing final arrangements on the test day;• conducting a make-up session, if needed, in
consultation with the SC;• completing the Student Tracking Form and
the Assessment Session Report Form (a formdesigned to summarise session times, studentattendance, any disturbance to the session,etc.);
• ensuring that the number of tests andquestionnaires collected from students talliedwith the number sent to the school;
• obtaining the School Questionnaire from theSC; and
• sending the School Questionnaire and all testmaterials (both completed and not completed)to the NPM after the testing was carried out.
Documentation
NPMs were given comprehensive proceduralmanuals for each major component of theassessment.• The National Project Manager’s Manual
provided detailed information about dutiesand responsibilities, including generalinformation about PISA; field operations androles and responsibilities of the NPM, the SCand the TA; translating the manuals and testinstruments; selecting the student sample;assembling and shipping materials; datamarking and entry; and documentation to besubmitted by the NPM to the Consortium.
• The School Coodinator’s Manual described, indetail, the activities and responsibilities of theSC. Information was provided regarding allaspects, from selecting a date for theassessment to arranging for a make-up sessionand storing the copies of the assessmentforms.
• The Test Administrator’s Manual not onlyprovided a comprehensive description of theduties and responsibilities of the TA, from
PIS
A 2
00
0 t
ech
nic
al r
epor
t
72
attending the TA training to conducting amake-up session, but also included the scriptto be read during the test as well as a ReturnShipment Form (a form designed to trackmaterials to and from the school).
• The Sampling Manual provided detailedinstructions regarding the selection of theschool and student samples and the reportsthat had to be submitted to the Consortium todocument each step in the process of selectingthese samples.These manuals also included checklists and
timetables for easy reference. NPMs wererequired to fill in country specific information inaddition to adding specific dates in both theSchool Co-ordinator’s Manual and the TestAdministrator’s Manual.
KeyQuest®, a sampling software package,was prepared by the Consortium and given toNPMs as one method for selecting the studentsample from the Student Listing Form. Althoughabout half did, in fact, use this software, someNPMs used their own sampling software (whichhad to be approved by the Consortium).
Materials Preparation
Assembling Test Booklets andQuestionnaires
As described in Chapter 2, nine different testbooklets had to be assembled with clusters oftest items arranged according to the balancedincomplete block design specified by theConsortium. Test items were presented in units(stimulus material and two or more itemsrelating to the stimulus) and each clustercontained several units. Test units andquestionnaire items were initially sent to NPMsseveral months before the testing dates, to enabletranslation to begin. Units allocated to clustersand clusters allocated to booklets were provideda few weeks later, together with detailedinstructions to NPMs about how to assembletheir translated or adapted clusters into booklets.
For reference, master hard copies of allbooklets were provided to NPMs and mastercopies in both English and French were alsoavailable through a secure website. NPMs wereencouraged to use the cover design provided bythe OECD (both black and white and colouredversions of the cover design were made
available). In formatting translated or adaptedtest booklets, they had to follow as far aspossible the layout in the English master copies,including allocation of items to pages. A slightlysmaller or larger font than in the master copywas permitted if it was necessary to ensure thesame page set-up as that of the source version.
NPMs were required to submit copies of theirassembled test booklets for verification by theConsortium before printing the booklets.
The Student Questionnaire contained one,two, or three modules, according to whether theinternational options of Cross-CurricularCompetency (CCC) and Computer Familiarityor Information Technology (IT) were beingadded to the core component. About half thecountries chose to administer the IT componentand just over three-quarters used the CCC com-ponent. The core component had to be presentedfirst in the questionnaire booklet. If bothinternational options were used, the IT modulewas to be placed ahead of the CCC module.
NPMs were permitted to add questions ofnational interest as ‘national options’ to thequestionnaires. Proposals and text for these hadto be submitted to the Consortium for approval.It was recommended that, if more than twopages of additional questions were proposed andthe tests and questionnaires were beingadministered in a single session, the additionalmaterial should be placed at the end of thequestionnaire. The Student Questionnaire wasmodified more often than the SchoolQuestionnaire.
NPMs were required to submit copies of theirassembled questionnaires for verification by theConsortium prior to printing.
Printing Test Booklets andQuestionnaires
Printing had to be done such that the content ofthe instruments was constantly secure. Giventhat the test booklet was administered in twoparts to give students a brief rest after the firsthour, the second half of the booklet had to bedistinguishable from the first half so that TAscould see if a student was working in the wrongpart. It was recommended that this be doneeither by having a shaded strip printed across thetop (or down the outside edge) of pages in thesecond half of the booklet, or by having the
Fie
ld o
per
atio
ns
73
second half of the booklet sealed with a sticker.If the Student Questionnaire was to be
included in the same booklet as the test items,then the questionnaire section had to beindicated in a different way from the two partsof the test booklet. Although this complicatedthe distribution of materials, it was simpler toprint the questionnaire in a separate booklet andmost countries did so.
Packaging and Shipping Materials
Regardless of how materials were packaged andshipped, the following needed to be sent eitherto the TA or to the school:• test booklets and Student Questionnaires for
the number of students sampled;• Student Tracking Form;• two copies of the Assessment Session Report
Form;• Packing Form;• Return Shipment Form;• additional materials, e.g., rulers and
calculators, as per local circumstances; and• additional School and Student Questionnaires
and a bundle of extra test booklets.Of the nine separate test booklets, one could
be pre-allocated to each student by theKeyQuest® software from a random startingpoint in each school. KeyQuest® could then beused to generate the school’s Student TrackingForm, which contained the number of theallocated booklet alongside each sampledstudent’s name. Instructions were also providedfor carrying out this step manually, starting withBooklet 1 for the first student in the first school,and assigning booklets in order up to the 35th
student for the designated number of students inthe sample. With 35 students (the recommendednumber of students to sample per school), thelast student in the first school would be assignedBooklet 8. At the second school, the first studentwould be assigned Booklet 9, the second,Booklet 1, and so on. In this way, the bookletswould be rotated so that each would be usedmore or less equally, as they also would be ifallocated from a random starting point.
It was recommended that labels be printed,each with a student identification number andtest booklet number allocated to thatidentification, plus the student’s name if this wasan acceptable procedure within the country. Two
or three copies of each student’s label could beprinted, and used to identify the test booklet, thequestionnaire, and a packing envelope if used.
NPMs were allowed some flexibility in howthe materials were packaged and distributed,depending on national circumstances. It wasspecified however that the test booklets for aschool be packaged so that they remainedsecure, possibly by wrapping them in clearplastic and then heat sealing the package, or bysealing each booklet in a labelled envelope.
Three scenarios, summarised here, were des-cribed as illustrative of acceptable approaches topackaging and shipping the assessmentmaterials:• Country A − all assessment materials shipped
directly to the schools (with fax-back formsprovided for SCs to acknowledge receipt ofmaterials); school staff (not teachers of thestudents in the assessment) used to conductthe testing sessions; test booklets and StudentQuestionnaires printed separately; materialsassigned to students before packaging forshipment; each test booklet andquestionnaire labelled, and then sealed in anidentically labelled envelope for shipping tothe school.
• Country B − all assessment materials shippeddirectly to the schools (with fax-back formsprovided for SCs to acknowledge receipt ofmaterials); test sessions conducted by TAsemployed by the national centre; test bookletsand Student Questionnaires printed separately,and labelled and packaged in separatelybound bundles assembled in Student TrackingForm order; on completion of the assessment,each student places and seals his/her testbooklet and questionnaire in a labelledenvelope provided to protect studentconfidentiality within the school.
• Country C − test sessions conducted by TAsemployed by the national centre, withassessment materials for the scheduled schoolsshipped directly to the TAs; test and StudentQuestionnaires printed in the one booklet,with a black bar across the top of each pagein the second part of the test; bundles of 35booklets loosely sealed in plastic, so that theirnumber can be checked without opening thepackages; school packages openedimmediately prior to the session by the TAand identification number and name labels
PIS
A 2
00
0 t
ech
nic
al r
epor
t
74
affixed according to the assignment ofbooklets pre-recorded on the Student TrackingForm at the national centre.
Receipt of Materials at the NationalCentre After Testing
It was recommended that the national centreestablish a database of schools before testingbegan with fields for recording shipment ofmaterials to and from schools, tallies ofmaterials sent and tallies of completed andunused booklets returned, and for various stepsin processing booklets after the testing.KeyQuest® could be used for this purpose ifdesired. TAs recorded the student’s participationstatus (present/absent) on the Student TrackingForm for each of the two test parts and for theStudent Questionnaire. This information was tobe checked at the national centre during theunpacking step.
Processing Tests andQuestionnaires AfterTesting
This section describes PISA’s markingprocedures, including multiple marking, and alsomakes brief reference to pre-coding of responsesto a few items in the Student Questionnaire.Because about one-third of the mathematics andscience items and more than 40 per cent of thereading items required students to write in theiranswers, the answers had to be evaluated andmarked (scored). This was a complex operation,as booklets had to be randomly assigned tomarkers and, for the minimum recommendedsample size per country of 4 500 students, morethan 120 000 responses had to be evaluated.Each of the nine booklets had between 20 and31 items requiring marking.
It is crucial for comparability of results in astudy such as PISA that students’ responses bescored uniformly from marker to marker andfrom country to country. Comprehensive criteriafor marking, including many examples ofacceptable and not acceptable responses, wereprepared by the Consortium and provided toNPMs in Marking Guides for each of reading,mathematics and science.
Steps in the Marking Process
In setting up the marking of students’ responsesto open-ended items, NPMs had to carry out oroversee several steps:• adapt or translate the Marking Guides as
needed;• recruit and train markers;• locate suitable local examples of responses to
use in training and practice;• organise booklets as they were returned from
schools;• select booklets for multiple marking;• single mark booklets according to the
international design;• multiple mark a selected sub-sample of
booklets once the single marking wascompleted;
• submit a sub-sample of booklets for the Inter-Country Rater Reliability Study (seeChapter 10).Detailed instructions for each step were
provided in the NPM Manual. Key aspects of theprocess are included here.
International trainingRepresentatives from each national centre wererequired to attend two international markertraining sessions—one immediately prior to thefield trial and one immediately prior to the mainstudy. At the training sessions, Consortium stafffamiliarised national centre staff with themarking guides and their interpretation.
StaffingNPMs were responsible for recruitingappropriately qualified people to carry out thesingle and multiple marking of the test booklets.In some countries, pools of experienced markersfrom other projects could be called on. It wasnot necessary for markers to have high-levelacademic qualifications, but they needed to havea good understanding of either mid-secondarylevel mathematics and science or the language ofthe test, and to be familiar with ways in whichsecondary-level students express themselves.Teachers on leave, recently retired teachers andsenior teacher trainees were all considered to bepotentially suitable markers. An importantfactor in recruiting markers was that they couldcommit their time to the project for the durationof the marking, which was expected to take upto two months.
Fie
ld o
per
atio
ns
75
The Consortium provided a MarkerRecruitment Kit to assist NPMs in screeningapplicants. These materials were similar innature to the Marking Guides, but were muchbriefer. They were designed so that applicantswho were considered to be potentially suitablecould be given a brief training session, afterwhich they marked some student responses.Guidelines for assessing the results of thisexercise were supplied. The materials alsoprovided applicants with the opportunity toassess their own suitability for the task.
The number of markers required was governedby the design for multiple marking (described ina later section). For the main study, markers wererequired in multiples of eight, with 16 as the recom-mended number for reading, and eight—whocould mark both mathematics and science—recommended for those areas. These numbers ofmarkers were considered to be adequate forcountries testing between 4 500 (the minimumnumber required) and 6 000 students to meet thetimeline of submitting their data within threemonths of testing.
For larger numbers of students or in caseswhere reading markers could not be obtained inmultiples of eight, NPMs could prepare theirown design for reading marking and submit it tothe Consortium for approval only a handful ofcountries did this. For NPMs who were unableto recruit people who could mark bothmathematics and science, an alternative designinvolving four mathematics markers and fourscience markers was provided (but this designhad the drawback that it yielded lesscomprehensive information from the multiplemarking). Given that several weeks wererequired to complete the marking, it wasrecommended that at least two back-up readingmarkers and one back-up mathematics/sciencemarker be trained and included in at least someof the marking sessions.
The marking process was complex enough torequire a full-time overall supervisor of activitieswho was familiar with logistical aspects of themarking design, the procedures for checkingmarker reliability, the marking schedules andalso the content of the tests and MarkingGuides.
NPMs were also required to designate personswith subject-matter expertise, familiarity withthe PISA tests and, if possible, experience in
marking student responses to open-ended items,in order to act as ‘table leaders’ during themarking. Good table leaders were essential tothe quality of the marking, as their main rolewas to monitor markers’ consistency in applyingthe marking criteria. They also assisted with theflow of booklets, and fielded and resolvedqueries about the Marking Guide and aboutparticular student responses in relation to theguide, consulting the supervisor as necessarywhen queries could not be resolved. Thesupervisor was then responsible for checkingsuch queries with the Consortium.
Depending on their level of experience and onthe design followed for mathematics/science,between two and four table leaders wererecommended for reading and either one or twofor mathematics/science. Table leaders wereexpected to participate in the actual markingand spend extra time monitoring consistency.
Several persons were needed to unpack,check, and assemble booklets into labelledbundles so that markers could respect thespecified design for randomly allocating sets ofbooklets to markers.
Confidentiality formsBefore seeing or receiving any copies of PISA testmaterials, prospective markers were required tosign a Confidentiality Form, obligating them notto disclose the content of the PISA tests beyondthe groups of markers and trainers with whomthey would be working.
National trainingAnyone who marked the PISA main survey testbooklets had to participate in specific trainingsessions, regardless of whether they had hadrelated experience or had been involved in thePISA field trial marking. To assist NPMs incarrying out the training, the Consortium preparedtraining materials in addition to the detailedMarking Guides. Training within a country couldbe carried out by the NPM or by one or moreknowledgeable persons appointed by the NPM.Subject matter knowledge was important for thetrainer as was understanding the procedures,which usually meant that more than one personwas involved in leading the training.
Training sessions were organised as a functionof how marking was done. The recommendationwas to mark by cluster, completing the marking
PIS
A 2
00
0 t
ech
nic
al r
epor
t
76
of each item separately within a cluster beforemoving to the next item, and completing onecluster before moving to the next. Therecommended allocation of booklets to markersassumed marking by cluster, though analternative design involving marking by bookletwas also provided.
If marking was done by cluster, then markerswere trained by cluster for the nine readingclusters and four clusters for each of mathematicsand science. If prospective markers had done asrecommended and worked through the testbooklets and read through the Marking Guides inadvance, training for each cluster took about halfa day. During a training session, the trainerreviewed the Marking Guide for a cluster of unitswith the markers, then had the markers assignmarks to some sample items for which theappropriate marks had been supplied by theConsortium. The trainer reviewed the results withthe group, allowing time for discussion, queryingand clarification of reasons for the pre-assignedmarks. Trainees then proceeded to markindependently some local examples that had beencarefully selected by the supervisor of marking inconjunction with national centre staff.
Training was more difficult if marking wasdone by booklet. Each booklet contained eithertwo or three 30-minute reading clusters andfrom two to four 15-minute mathematics/scienceclusters (except Booklet 7, which had fourreading clusters). Thus, markers had to betrained in several clusters before they couldbegin marking a booklet.
It was recommended that prospective markersbe informed at the beginning of training thatthey would be expected to apply the MarkingGuides with a high level of consistency, and thatreliability checks would be made frequently bytable leaders and the overall supervisor as partof the marking process.
Ideally, table leaders were trained before thelarger groups of markers since they needed to bethoroughly familiar with both the test items andthe Marking Guides. The marking supervisorexplained these to the point where the tableleaders could mark and reach a consensus on theselected local examples to be used later with thelarger group of trainees. They also participatedin the training sessions with the rest of themarkers, partly to strengthen their ownknowledge of the Marking Guides and partly to
assist the supervisor in discussions with the traineesof their pre-agreed marks to the sample items.
Table leaders received additional training inthe procedures for monitoring the consistencywith which markers applied the criteria.
Length of marking sessionsMarking responses to open-ended items ismentally demanding, requiring a level ofconcentration that cannot be maintained forlong periods of time. It was thereforerecommended that markers work for no morethan six hours per day on actual marking, andtake two or three breaks for coffee and lunch.Table leaders needed to work longer on mostdays so that they had adequate time for theirmonitoring activities.
Logistics Prior to Marking
Sorting booklets
When booklets arrived back at the nationalcentre, they were first tallied and checkedagainst the session participation codes on theStudent Tracking Form. Unused and usedbooklets were separated; used booklets weresorted by student identification number if theyhad not been sent back in that order and thenwere separated by booklet number; and schoolbundles were kept in school identification order,filling in sequence gaps as packages arrived.Student Tracking Forms were carefully filed inring binders in school identification order. If theschool identification number order did notcorrespond with the alphabetical order of schoolnames, it was recommended that an index ofschool name against school identification beprepared and kept with the binders.
Because of the time frame within whichcountries had to have all their marking done anddata submitted to the Consortium, it was usuallyimpossible to wait for all materials to reach thenational centre before beginning to mark. Inorder to manage the design for allocatingbooklets to markers, however, it wasrecommended to start marking only when atleast half of the booklets had been returned.
Selection of booklets for multiple markingThe Technical Advisory Group decided to setaside in each country 48 each of Booklets 1 to 7and 72 each of Booklets 8 and 9 for multiple
Fie
ld o
per
atio
ns
77
marking. The mathematics and science clusterswere more thinly spread in Booklets 1 to 7 thanin Booklets 8 and 9. The larger number ofBooklets 8 and 9 was specified to have sufficientdata on which to base the marker reliabilityanalyses for mathematics and science.
The main principle in setting aside thebooklets for multiple marking was that theselection needed to ensure a wide spread ofschools and students across the whole sampleand to be random as far as possible. Thesimplest method for carrying out the selectionwas to use a ratio approach based on theexpected total number of completed booklets,combined with a random starting point.
In the minimum recommended student sampleof 4 500 students per country, approximatelyevery tenth booklet of Booklets 1 to 7 and everyseventh booklet of Booklets 8 and 9 needed to beset aside. Random numbers between 1 and 10and between 1 and 7 needed to be drawn as thestarting points for selection. Depending on theactual numbers of completed booklets received,the selection ratios needed to be adjusted so thatthe correct numbers of each booklet were selectedfrom the full range of participating schools.
Booklets for single markingOnly one marker was required to mark all bookletsremaining after those for multiple marking hadbeen set aside. For the minimum required samplesize of 4 500, there would have been approximately450 of each of Booklets 1 to 7 and 430 of eachof Booklets 8 and 9 in this category.
Some items requiring marking did not need tobe included in the multiple marking. The lastmarker in the multiple-marking process markeditems in the booklets set aside for multiplemarking.
How Marks Were Shown
A string of small code numbers corresponding tothe possible codes for the item as delineated inthe relevant Marking Guide (including the codefor ‘not applicable’ to allow for a misprinteditem) appeared in the upper right-hand side ofeach item in the test booklets requiringjudgement. For booklets being processed by asingle marker, the mark assigned was indicateddirectly in the booklet by circling theappropriate code number alongside the item.
Tailored marking record sheets were preparedfor each booklet for the multiple marking andused by all but the last marker so that eachmarker undertaking multiple marking did notknow which marks other markers had assigned.
For the reading tests, item codes were oftenjust 0, 1, 9 and ‘n’, indicating incorrect, correct,missing and ‘not applicable’, respectively.Provision was made for some of the open-endeditems to be marked as partially correct, usuallywith ‘2’ as fully correct and ‘1’ as partiallycorrect, but occasionally with three degrees ofcorrectness indicated by codes of ‘1’, ‘2’ and ‘3’.
For the mathematics and science tests, a two-digit coding scheme was adopted for the itemsrequiring constructed responses. The first digitrepresented the ‘degree of correctness’ mark, asin reading; the second indicated the content ofthe response or the type of solution method usedby the student. Two-digit codes were originallyproposed by Norway for the Third InternationalMathematics and Science Study (TIMSS) andwere adopted in PISA because of their potential foruse in studies of student learning and thinking.
Marker IDentification Numbers
Marker identification numbers were assignedaccording to a standard three-digit formatspecified by the Consortium. The first digit hadto show if the marker was a reading, mathem-atics or science marker (or mathematics/science),and the second and third digits had to uniquelyidentify the markers within their set. Markeridentification numbers were used for twopurposes: implementing the design for allocatingbooklets to markers; and in monitoring markerconsistency in the multiple-marking exercises.
Design for Allocating Booklets to Markers
Reading
If marking was done by cluster, each readingmarker needed to handle three of the ninebooklet types at a time because the clusters werefeatured in three booklets, except for the ninthcluster, which appeared in only two of thebooklets. For example, reading Cluster 1occurred in Booklets 1, 5 and 7, which thereforehad to be marked before any other items inanother cluster. Moreover, since marking was
PIS
A 2
00
0 t
ech
nic
al r
epor
t
78
done item by item, the item was marked acrossthree booklets before the next item was marked.
A design to ensure the random allocation ofbooklets to reading markers was prepared basedon the recommended number of 16 markers andthe minimum sample size of 4 500 students from150 schools.1 Booklets were to be sorted by studentindentification within schools. With 150 schoolsand 16 markers, each marker had to mark a clusterwithin a booklet from eight or nine schools (150 ÷ 16 ≅ 9). Figure 6 shows how bookletsneeded to be assigned to markers for the readingsingle marking. Further explanation of theinformation in this table is presented below.
According to this design, Cluster 1 in subset 1(schools 1 to 9) was to be marked by ReadingMarker 1 (M1 in Figure 6), Cluster 1 in subset 2(schools 10 to 18) was to be marked by ReadingMarker 2 (M2 in Figure 6), and so on. ForCluster 2, Reading Marker 1 was to mark allfrom subset 2 (schools 10 to 18) and ReadingMarker 2 was to mark all from subset 3 (schools19 to 27). Subset 1 of Cluster 2 (schools 1 to 9)was to be marked by Reading Marker 16.
If booklets from all participating schools wereavailable before the marking began, the followingsteps would be involved in implementing the design:
Fie
ld o
per
atio
ns
79
Subsets of Schools
Cl Bklt 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
R1 1,5,7 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 M16
R2 1,2,6 M16 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15
R3 2,3,7 M15 M16 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14
R4 1,3,4 M14 M15 M16 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13
R5 2,4,5 M13 M14 M15 M16 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12
R6 3,5,6 M12 M13 M14 M15 M16 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11
R7 4,6,7 M11 M12 M13 M14 M15 M16 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
R8 7,8,9 M10 M11 M12 M13 M14 M15 M16 M1 M2 M3 M4 M5 M6 M7 M8 M9
R9 8,9 M9 M10 M11 M12 M13 M14 M15 M16 M1 M2 M3 M4 M5 M6 M7 M8
Figure 6: Allocation of the Booklets for Single Marking of Reading by Cluster
Step 1: Set aside booklets for multiple markingand then divide the remaining booklets intoschool subsets as above; (subset 1: schools 1 to 9;subset 2: schools 10 to 18, etc., to achieve 16subsets of schools).
Step 2: Assuming that marking begins withCluster 1:Marker 1 takes Booklets 1, 5 and 7 for SchoolSubset 1;Marker 2 takes Booklets 1, 5 and 7 for SchoolSubset 2; etc.; untilMarker 16 takes Booklets 1, 5 and 7 forSchool Subset 16.
Step 3: Markers mark all of the first Cluster 1item requiring marking in the booklets thatthey have 1, 5 and 7.
Step 4: The second Cluster 1 item is marked inall three booklet types, followed by the thirdCluster 1 item, etc., until all Cluster 1 itemsare marked.
Step 5: For Cluster 2, as per the row of thetable in Figure 6 corresponding to R2 in theleft-most column, each marker is allocated asubset of schools different from their subsetfor Cluster 1. Marking proceeds item by itemwithin the cluster.
Step 6: For the remaining clusters, the rowscorresponding to R3, R4, etc., in the table arefollowed in succession.
1 Countries with more or fewer than 150 schools or adifferent number of markers had to adjust the size ofthe school subsets accordingly.
As a result of this procedure, the booklets fromeach subset of schools are processed by ninedifferent reading markers and each student’sbooklet is marked by three different readingmarkers (except Booklet 7, which has four readingclusters, and Booklets 8 and 9, which have onlytwo reading clusters). Spreading booklets amongmarkers in this way minimises the effects of anysystematic leniency or harshness in marking.
In practice, most countries would not have hadcompleted test booklets back from all their sampledschools before marking needed to begin. NPMs wereencouraged to organise the marking in two waves,so that it could begin after materials were receivedback from one-half of their schools. Schools wouldnot have been able to be assigned to school sets formarking exactly in their school identificationorder, but rather by identification order combinedwith when their materials were received andprocessed at the national centre.
Mathematics and scienceWith eight mathematics/science markers whocould mark in both areas, the booklets needed to
be assigned to markers according to Figure 7 formarking by cluster. Here, the subsets of schoolswould contain twice the number of schools asfor the reading marking, because schools wereshared among eight and not 16 markers. Thiscould be done by combining subsets 1 and 2from reading into subset 1 for mathematics/science, and so on unless the mathematics/science marking was done before the reading, inwhich case the reading subsets could be formedby halving the mathematics/science subsets.(Logistics were complex because all of thebooklets that contained mathematics or scienceitems also contained reading items, and had tobe shared with the reading markers.)
It was recommended that the marking in onestage be completed before marking in the otherbegan. Stage 2 could be undertaken before Stage 1, if this helped with the flow of booklets.
If separate markers were needed formathematics and science, the single markingcould be accomplished with the design shown inFigure 8, in which case schools needed to bedivided into only four subsets.
80
Subsets of SchoolsStage Cluster Booklet 1 2 3 4 5 6 7 8
1: Mathematics M1 1,9 M1 M2 M3 M4 M5 M6 M7 M8
M2 1,5,8 M7 M8 M1 M2 M3 M4 M5 M6
M3 3,5,9 M5 M6 M7 M8 M1 M2 M3 M4
M4 3,8 M3 M4 M5 M6 M7 M8 M1 M2
2: Science S1 2,8 M1 M2 M3 M4 M5 M6 M7 M8
S2 2,6,9 M7 M8 M1 M2 M3 M4 M5 M6
S3 4,6,8 M5 M6 M7 M8 M1 M2 M3 M4
S4 4,9 M3 M4 M5 M6 M7 M8 M1 M2
Figure 7: Booklet Allocation for Single Marking of Mathematics and Science by Cluster, Common Markers
Mathematics Science Subsets of Schools
Cluster Booklet Cluster Booklet 1 2 3 4
M1 1,9 S1 2,8 M1 M2 M3 M4
M2 1,5,8 S2 2,6,9 M2 M3 M4 M1
M3 3,5,9 S3 4,6,8 M3 M4 M1 M2
M4 3,8 S4 4,9 M4 M1 M2 M3
Figure 8: Booklet Allocation for Single Marking of Mathematics and Science by Cluster, Separate Markers
PIS
A 2
00
0 t
ech
nic
al r
epor
t
SE booklet (Booklet 0)Countries using the shorter, special purpose SEbooklet (numbered as Booklet 0) were advised toprocess this separately from the remainingbooklets. Small numbers of students used thisbooklet, only a few items required marking, andthey were not arranged in clusters. NPMs werecautioned that booklets needed to be allocatedto several markers to ensure uniform applicationof marking criteria for the SE booklet, as for themain marking.
Managing the Actual Marking
Booklet flow
To facilitate the flow of booklets, it was importantto have ample table surfaces on which to placeand arrange them by type and school subset. Thebundles needed to be clearly labelled. For thispurpose, it was recommended that each bundleof booklets be identified by a batch header foreach booklet type (Booklets 1 to 9), with spacesfor the number of booklets and school identific-ations represented in the bundle to be written in.In addition, each header sheet was to be pre-printed with a list of the clusters in the booklet,with columns alongside which the date and time,marker’s name and identification and tableleader’s initials could be entered as the bundlewas marked and checked.
Separating the reading and mathematics/science marking
It was recommended that reading and mathema-tics/science marking be done at least partly atdifferent times (for example, mathematics/sciencemarking could start a week or two ahead of thereading marking). Eight of the nine bookletscontained material in reading together withmaterial in mathematics and/or science. Exceptfor Booklet 7, which contained only reading, itcould have been difficult to maintain an efficientflow of booklets through the marking process ifall marking in all three domains were donesimultaneously.
Familiarising markers with the markingdesign
The relevant design for allocating booklets tomarkers was explained either during the markertraining session or at the beginning of the firstmarking session (or both). The marking
supervisor was responsible for ensuring thatmarkers adhered to the design, and used clericalassistants if needed. Markers could betterunderstand the process if each was providedwith a card indicating the bundles of booklets tobe taken and in which order.
Consulting table leadersDuring the initial training, practice, and review,it was expected that coding issues would bediscussed openly until markers understood therationale for the marking criteria (or reachedconsensus where the Marking Guide wasincomplete). Markers were advised to workquietly, referring queries to their table leaderrather than to their neighbours. If a particularquery arose often, the table leader was advisedto discuss it with the rest of the group.
Markers were not permitted to consult othermarkers or table leaders during the additionalpractice exercises (see next subsection) to gaugewhether all or some markers needed more trainingand practice, or during the multiple marking.
Monitoring single markingThe steps described here represent the minimumlevel of monitoring activities required. Countrieswishing to implement more extensive monitoringprocedures during single marking wereencouraged to do so.
The supervisor, assisted by table leaders, wasadvised to collect markers’ practice papers aftereach cluster practice session and to tabulate themarks assigned. These were then to be comparedwith the pre-agreed marks: each matching markwas considered a hit and each discrepant markwas considered a miss. To reflect an adequatestandard of reliability, the ratio of hits to thetotal of hits plus misses needed to be 0.85 ormore. In mathematics and science, this reliabilitywas to be assessed on the first digit of the two-digit codes. A ratio of less than 0.85, especiallyif lower than 0.80, was to be taken as indicatingthat more practice was needed, and possibly alsomore training.
Table leaders played a key role during eachmarking session and at the end of each day, byspot-checking a sample of booklets or items thathad already been marked to identify problemsfor discussion with individual markers or withthe wider group, as appropriate. All bookletsthat had not been set aside for multiple marking
Fie
ld o
per
atio
ns
81
were candidates for this spot-checking. It wasrecommended that, if there were indications fromthe practice sessions that one or more particularmarkers might be experiencing problems in usingthe Marking Guide consistently, then more of thosemarkers’ booklets should be included in the checking.
Table leaders were advised to review theresults of the spot-checking with the markers atthe beginning of the next day’s marking. Thiswas regarded primarily as a mentoring activity,but NPMs were advised to keep in contact withtable leaders and the marking supervisor if therewere individual markers who did not meetcriteria of adequate reliability and would need tobe removed from the pool.
Table leaders were to initial and date theheader sheet of each batch of booklets for whichthey had carried out spot-checking. Someitems/booklets from each batch and each markerhad to be checked.
Multiple Marking
For PISA 2000, multiple marking meant thatfour separate markers marked all short responseand open constructed-response items in 48 ofeach of Booklets 1 to 7 and 72 of each ofBooklets 8 and 9. Multiple marking was done ator towards the end of the marking period, aftermarkers had familiarised themselves with andwere confident in using the Marking Guides.
As noted earlier, the first three markers of theselected booklets circled codes on separaterecord sheets, tailored to booklet type andsubject area (reading, mathematics or science),using one page per student. The markingsupervisor checked that markers correctlyentered student identification numbers and theirown identification number on the sheets, whichwas crucial to data quality. Also as noted earlier,the SE booklet was not included in the multiplemarking.
In a country where booklets were provided inmore than one language, multiple marking wasrequired only in the language used for themajority of the PISA booklets, thoughencouraged in more than one language ifpossible. If two languages were used equally, theNPM could choose either language.
While markers would have been thoroughlyfamiliar with the Marking Guides by this time,they may have most recently marked a different
PIS
A 2
00
0 t
ech
nic
al r
epor
t
82
booklet from those allocated to them for multiplemarking. For this reason, they needed to havetime to re-read the relevant Marking Guide beforebeginning the marking. It was recommended thatat least half a day be used for markers to refreshtheir familiarity with the guides and to look againat the additional practice material beforeproceeding with the multiple marking.
As in the single marking, marking was to bedone item by item. For manageability, all itemswithin a booklet were to be marked beforemoving to the next booklet. Rather than markingby cluster across several booklet types, it wasconsidered that markers would be experiencedenough in applying the marking criteria by thistime that marking by booklet would be unlikelyto detract from the quality of the data.
Booklet allocation design The specified multiple marking design forreading, shown in Figure 9, assumed 16 markerswith identification numbers 201 to 216. Theimportance of following the design exactly asspecified was stressed, as it provides forbalanced links between clusters and markers.
Figure 8 shows 16 markers grouped into eightpairs of two, with Group 1 comprising the firsttwo markers (201 and 202), Group 2 the nexttwo (203 and 204), etc. The design involved twosteps, with the booklets divided into two sets.Booklets 1 to 4 made up one set, and Booklets 5to 9 the second set. The four markings were tobe carried out by allocating booklets to the fourmarkers shown in the right-hand column of thetable for each booklet.
Step Booklet Marker Marker IdentificationGroups
1 1 and 201, 202, 203, 204
2 and 205, 206, 207, 208
3 and 209, 210, 211, 212
4 and 213, 214, 215, 216
2 7 and 203, 204, 205, 206
5/6 and 207, 208, 209, 210
8 and 211, 212, 213, 214
9 and 215, 216, 201, 202
Figure 9: Booklet Allocation to Markers for Multiple Marking ofReading
In this scenario, with all 16 markers working,Booklets 1 to 4 were to be marked at the sametime in the first step. The 48 Booklet 1’s, forexample, were to be divided into four bundles of12, and rotated among markers 201, 202, 203and 204, so that each marker eventually wouldhave marked all 48 of this booklet. The samepattern was to be followed for Booklets 2, 3 and 4.Each booklet had approximately the samenumber of items requiring marking and the samenumber of booklets (48) to be marked.
After Booklets 1 to 4 had been put through themultiple-marking process, the pairs of markers wereto regroup (but remain in their pairs) and followthe allocation in the second half of Figure 9.That is, markers 203, 204, 205 and 206 were tomark Booklet 7, markers 207, 208, 209 and 210were to mark Booklets 5 and 6, and so on forthe remaining booklets. Booklets 5 and 6contained fewer items requiring marking thanBooklet 7, and there were more Booklets 8 and 9to process. The closest to a balanced workloadwas achieved by regarding Booklets 5 and 6 asequivalent to one booklet for this exercise.
If only eight markers were available, thedesign could be applied by using the groupdesignations in Figure 9 to indicate markeridentification numbers. However, four steps, nottwo, were needed to achieve four markings perbooklet. The third and fourth steps could beachieved by starting the third step with Booklet 1marked by markers 3 and 4 and continuing thepattern in a similar way as in the figure, and bystarting the fourth step with Booklet 7 markedby markers 4 and 5.
Allocating booklets to markers for multiplemarking was quite complex and the markingsupervisor had to monitor the flow of bookletsthroughout the process.
The multiple-marking design for mathematicsand science shown in Figure 10 assumed eightmarkers, with identification numbers 401 to 408,who each marked both mathematics and science.This design also provided for balanced linksbetween clusters and markers. The designrequired four stages, which could be carried outin a different order from that shown if this helpedwith booklet flow, but the allocation of bookletsto markers had to be retained. Since there weremore items to process in stage 4, it was importantfor all markers to undertake this stage together.
Fie
ld o
per
atio
ns
83
If different markers were used for mathem-atics and science, a different multiple-markingdesign was necessary. Assuming four markers foreach of mathematics and science, then each booklethad to be marked by each marker. According tothe scheme for assigning marker identificationnumbers, the markers would have had identificationnumbers: 101, 102, 103 and 104 for mathem-atics; and 301, 302, 303 and 304 for science.
Five booklets contained mathematics items(Booklets 1, 3, 5, 8 and 9), and five containedscience items (Booklets 2, 4, 6, 8 and 9). Becausethere were different numbers of booklets andvery large differences in numbers of itemsrequiring multiple marking within them, it wasimportant for all markers to work on the samebooklet at the same time.
The design in Figure 11 was suggested, with ananalogous one for science. The rotation numbersare marker identification numbers. If it helpedthe overall workflow, booklets could be markedin a different order from that shown in the figure.
Stage Booklet Domain Marker Identifications
1 1 Mathematics 401, 402, 403, 404
2 Science 405, 406, 407, 408
2 3 Mathematics 402, 403, 404, 405
4 Science 406, 407, 408, 401
3 5 Mathematics 403, 404, 405, 406
6 Science 407, 408, 401, 402
4 8 Maths and Science 404, 405, 406, 407
9 Maths and Science 408, 401, 402, 403
Figure 10: Booklet Allocation for Multiple Marking ofMathematics and Science, Common Markers
Booklet First Second Third FourthRotation Rotation Rotation Rotation
1 101 102 103 1043 102 103 104 1015 103 104 101 1028 104 101 102 1039 101 102 103 104
Figure 11: Booklet Allocation for Multiple Marking of Mathematics
Cross-national Marking
Cross-national comparability in assigning markswas explored through an inter-country raterreliability study (see Chapter 10 and Chapter 14).
Questionnaire Coding
The main coding required for the StudentQuestionnaire internationally was the mother’sand father’s occupation and student’soccupational expectation. Four-digitInternational Standard Classification ofOccupations (ISCO88) codes (InternationalLabour Organisation, 1988) were assigned tothese three variables. In several countries, thiscould be done in many ways. NPMs could use anational coding scheme with more than 100occupational title categories, provided that thisnational classification could be recoded to ISCO.A national classification was preferred becauserelationships between occupational status andachievement could then be compared within acountry using both international and nationalmeasures of occupational status.
The PISA website gave a short, clear summaryof ISCO codes and occupational titles forcountries to translate if they had neither anational occupational classification scheme noraccess to a full translation of ISCO.
In their national options, countries may alsohave needed to pre-code responses to some itemsbefore data from the questionnaire were enteredinto the software.
Data Entry, Data Checkingand File Submission
Data Entry
The Consortium provided participating countrieswith data entry software (KeyQuest®) that ranunder Windows 95® or later, and Windows NT4.0® or later. KeyQuest® contained thedatabase structures for all the booklets andquestionnaires used in the main survey. Variablescould be added or deleted as needed for nationaloptions. Data were to be entered directly fromthe test booklets and questionnaires, except forthe multiple-marking study, where the marksfrom the first three markers had been written onseparate sheets.
KeyQuest® performed validation checks as
data were entered. Importing facilities were alsoavailable if data had already been entered intotext files, but it was strongly recommended thatdata be entered directly into KeyQuest® to takeadvantage of its many PISA-specific features.
A separate Data Entry Manual provided fulldetails of the functionality of the KeyQuest®software and complete instructions on dataentry, data management and how to carry outvalidity checks.
Data Checking
NPMs were responsible for ensuring that manychecks of the quality of their country’s data weremade before the data files were submitted to theConsortium. The checking procedures requiredthat the List of Sampled Schools and the StudentTracking Form for each school were alreadyaccurately completed and entered intoKeyQuest®. Any errors had to be correctedbefore the data were submitted. Copies of thecleaning reports were to be submitted togetherwith the data files. More details on the cleaningsteps are provided in Chapter 11.
Data Submission
Files to be submitted included:• data for the test booklets and context
questionnaires;• data for the international option instrument(s)
if used;• data for the multiple-marking study;• List of Sampled Schools; and• Student Tracking Forms.
Hard or electronic copies of the last two itemswere also required.
After Data Were Submitted
NPMs were required to designate a datamanager who would work actively with theConsortium’s data processing centre at ACERduring the international data cleaning process.Responses to requests for information by theprocessing centre were required within threeworking days of the request.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
84
Quality Monitoring
Adrian Harvey-Beavis and Nancy Caldwell
It is essential that a high profile and expensiveproject such as PISA be undertaken with highstandards. This requires not only that proceduresbe carefully developed, but that they be carefullymonitored to ensure that they are in fact fullyand completely implemented. Should it happenthat they were not implemented fully, it is neces-sary to understand to what extent they were not,and the likely implications of this for dataquality. This is the task of quality monitoring.
Quality Monitoring in PISA is, therefore,about observing the extent to which data arecollected, retrieved, and stored according to theprocedures described by the field operationsmanuals. Quality control is embedded in thefield operations procedures. For PISA 2000,Quality Monitors were appointed to do thisobserving, but the responsibility for qualitycontrol resided with the National ProjectManagers (NPMs) who were to implement thefield operation guidelines and thus establishquality control.
A program of national centre and school visitswas central to ensuring full, valid implementationof PISA procedures. The main aims of the site visitprogram were to forestall operational problemsand to ensure that the data collected in differentcountries were comparable and of the highestquality.
There were two levels of quality monitoring inthe PISA project: • National Centre Quality Monitors (NCQMs)—
To observe how PISA field operations were beingimplemented at the national level, Consortiumrepresentatives monitored the national centresby visiting NPMs in each country just prior to
the field trial and just prior to the main study.• School Quality Monitors (SQMs)—Employed by
the Consortium, but located in participatingcountries, SQMs visited a sample of schools torecord how well the field operations guidelineswere followed. They visited a small number ofschools for the field trial (typically around fivein each country) and around 30 to 40 schoolsfor the main study.
Preparation of QualityMonitoring Procedures
National Centre Quality Monitors (NCQMs)were members of the Consortium. They metprior to the field trial to develop the instrumentsneeded to collect data for reporting purposes. Astandardised interview schedule was developedfor use in discussions with the NPM. Thisinstrument addressed key aspects of proceduresdescribed in the NPM Manual and other topicsrelevant to considering data quality, for example,relations between participating schools and thenational centre.
A School Quality Monitor Manual was preparedfor the field trial and later revised for the mainstudy. This manual outlined the duties of SQMs,including con-fidentiality requirements. Itcontained the Data Collection Sheet to be usedby the SQM when visiting a school. Aninterview schedule to be used with the SchoolCo-ordinator was also included. Additionally,for the field trial, a short interview to get studentfeedback on questionnaires was included.
Chapter 7
Implementing QualityMonitoring Procedures
Training National Centre QualityMonitors and School QualityMonitors
The Consortium organised training sessions forthe National Centre Quality Monitors (NCQMs),who trained the School Quality Monitors (SQMs).
As part of their training, NCQMs received anoverview of the design and purpose of theproject, which gave special attention to theresponsibilities of NPMs in conducting the studyin their country.
The Manual for National Centre QualityMonitors was used for the training session, andNCQMs were trained to use a schedule forinterviewing NPMs.
School Quality Monitors were trained toconduct on-site quality monitoring in schoolsand to prepare a report on the school visits. TheSchool Quality Monitor Manual was used fortheir training sessions.
National Project Managers were asked to:• nominate individuals who could fulfil the role
of SQM, who were then approved by theConsortium. Where nominations were notapproved, new nominations were solicited;and
• provide the SQMs with the schedule fortesting dates and times and other details suchas name of school, location, contact details,name of School Co-ordinator and TestAdministrator (if the two were different).
National Centre Quality Monitoring
For both the field trial and the main study,NCQMs visited all national centres, includingnational centres where a Consortium memberwas based.1
The visits to the NPMs took place about amonth before testing started. This meant thatprocedures could be reviewed when possible tomake minor adjustments. It also meant that the
PIS
A 2
00
0 t
ech
nic
al r
epor
t
86
NCQMs could train SQMs not too long beforethe testing. National Centre Quality Monitorsasked NPMs about (i) implementing proceduresat the national centre and (ii) any difficultiessuggesting that changes may be needed to thefield operations.
Questions on procedures were focused on:• how sampling procedures were implemented;• communication between national centres and
participating schools;• translating, printing and distribution of
materials to schools; • management procedures for materials at the
national centre;• planning for data preparation, data checking
and analysis;• data collection quality control procedures,
particularly documentation of data collectionand data entry; and
• data management procedures.While visiting NPMs, the NCQMs collected
copies of manuals and tracking forms used ineach country, that were subsequently reviewedand evaluated.
School Quality Monitoring
National Project Managers (NPMs) were askedto nominate SQMs and advised to consider thatpersons nominated:• should be knowledgeable, or might reasonably
be expected to easily learn about, theprocedures and materials of PISA;
• must speak fluently the language of thecountry and either English or French;
• should have an education or assessmentbackground;
• may be Consortium staff or individualsspecifically selected to undertake schoolmonitoring;
• need to be approved by the Consortium andwork independently of NPMs;
• need to be available to attend a trainingsession; and
• may undertake monitoring in their owncountry or exchange with another countrythrough a reciprocal agreement.For the main study, NPMs were also advised
that it was preferable that an SQM trained forthe field trial should also undertake school visitsduring the main study in 2000.
The Consortium paid SQMs’ expenses and fees.
1 To ensure independence of the Quality Monitorsfrom the centres they visited it was decided, for thefield trial, that nationals should not visit their owncentre. For the main study, this was relaxed, as therewas no evidence in the field trial of problems amongthese national centres.
For school visits, the field trial provided theopportunity to observe and judge:• the quality of the Test Administrator’s
Manual;• the clarity of instructions used by the Test
Administrators; and• whether any of the procedures or
documentation used in the main study shouldbe modified.For the main study, school visits focused
explicitly on observing the implementation ofprocedures, and the impact of the testingenvironment on data. During their visits, SQMs:• reviewed the activities preliminary to the test
administration, such as distribution ofbooklets;
• observed test administration;• verified the school Test Administrator’s record
keeping for example, whether assessedstudents match sampled students;
• asked for comments on procedures andmaterials; and
• completed a report on each school visited,which was returned to the Consortium.The original plan was to have SQMs visit
schools selected at random, but this procedureproved too costly to implement. In all countries,at least one school was randomly selected, andthe remainder were selected with considerationgiven to proximity to the residence of the SQMto contain travel and accommodation costs. Inmany countries, the school selection procedurewas to divide the country into regions with oneSQM responsible for each region. One schoolwas randomly selected in each region, and theremainder selected also with consideration givento proximity to the SQM’s residence. In practice,SQMs often still had to travel large distancesbecause many schools had testing on the same day.
The majority of school visits wereunannounced. Three countries would not permitthis because national policy and othercircumstances (especially, for example, remoteschools) required making prior arrangementswith the schools so they could assist withtransportation and accommodation.
Site Visit Data
Reports on the results of the national centre andschool site visits were prepared by the Consortiumand distributed to national centres after both the
field trial and the main study.For the main study, the national Quality
Monitoring reports were used as part of the dataadjudication process (see Chapter 4). Anaggregated report on quality monitoring is alsoincluded as Appendix 5.
Qu
alit
y m
onit
orin
g
87
Survey Weighting and theCalculation of SamplingVariance
Keith Rust and Sheila Krawchuk
Survey weights were required to analyse PISAdata, to calculate appropriate estimates ofsampling error, and to make valid estimates andinferences. The Consortium calculated surveyweights for all assessed and excluded students,and provided variables in the data that permitusers to make approximately unbiased estimatesof standard errors, to conduct significance testsand to create confidence intervals appropriately,given the sample design for PISA in eachindividual country.
Survey Weighting
Students included in the final PISA sample for agiven country are not all equally representativeof the entire student population, despite randomsampling of schools and students for selectingthe sample. Survey weights must therefore beincorporated into the analysis.
There are several reasons why the surveyweights are not the same for all students in agiven country:• A school sample design may intentionally over
or under-sample certain sectors of the schoolpopulation: in the former case, so that theycould be effectively analysed separately fornational purposes, such as a relatively smallbut politically important province or region,or a sub-population using a particularlanguage of instruction; and in the latter case,for reasons of cost, or other practicalconsiderations,1 such as very small orgeographically remote schools.
section three: data processing
• Information about school size available at thetime of sampling may not have beencompletely accurate. If a school was expectedto be very large, the selection probability wasbased on the assumption that only a sampleof its students would be selected for PISA. Butif the school turned out to be quite small, allstudents would have to be included andwould have, overall, a higher probability ofselection in the sample than planned, makingthese inclusion probabilities higher than thoseof most other students in the sample.Conversely, if a school thought to be smallturned out to be large, the students includedin the sample would have had smallerselection probabilities than others.
• School non-response, where no replacementschool participated, may have occurred,leading to the under-representation of studentsfrom that kind of school, unless weightingadjustments were made. It is also possible thatonly part of the eligible population in a school(such as those 15-year-olds in a single grade)were represented by its student sample, whichalso requires weighting to compensate for themissing data from the omitted grades.
• Student non-response, within participatingschools, occurred to varying extents. Studentsof the kind that could not be givenachievement test scores (but were notexcluded for linguistic or disability reasons)
1 Note that this is not the same as excluding certainportions of the school population. This also happenedin some cases, but cannot be addressed adequatelythrough the use of survey weights.
Chapter 8
will be under-represented in the data unlessweighting adjustments are made.
• Trimming weights to prevent undue influenceof a relatively small subset of the school orstudent sample might have been necessary if asmall group of students would otherwise havemuch larger weights than the remainingstudents in the country. This can lead tounstable estimates—large sampling errors—but cannot be well estimated. Trimmingweights introduces a small bias into estimatesbut greatly reduces standard errors.
• Weights need adjustment for analysingmathematics and science data to reflect the factthat not all students were assessed in each subject.The procedures used to derive the survey
weights for PISA reflect the standards of bestpractice for analysing complex survey data, andthe procedures used by the world’s major statisticalagencies. The same procedures were used in otherinternational studies of educational achievement:the Third International Mathematics and ScienceStudy (TIMSS), the Third InternationalMathematics and Science Study-Repeat (TIMSS-R), the Civic Education Study (CIVED), theProgress in International Reading Literacy Study2001 (PIRLS), which were all implemented bythe International Association for the Evaluationof Educational Achievement (IEA); and theInternational Assessment of Educational Progress(IAEP, 1991). (See Cochran, 1977 and Särndal,Swensson and Wretman, 1992 for the underlyingstatistical theory on survey sampling texts.)
The weight, , for student j in school iconsists of two base weights—the school and thewithin-school—and five adjustment factors, andcan be expressed as:
where: , the school base weight, is given as the
reciprocal of the probability of inclusion ofschool i into the sample;
, the within-school base weight, is given as thereciprocal of the probability of selection ofstudent j from within the selected school i;
is an adjustment factor to compensate fornon-participation by other schools that aresomewhat similar in nature to school i (notalready compensated for by the participationof replacement schools);
is an adjustment factor to compensate for thefact that, in some countries, in some schoolsonly 15-year-old students who were enrolledin the modal grade for 15-year-olds wereincluded in the assessment;
is an adjustment factor to compensate for theabsence of achievement scale scores fromsome sampled students within school i (whowere not excluded);
t1i is a school trimming factor, used to reduceunexpectedly large values of ; and
t2ij is a student trimming factor, used to reducethe weights of students with exceptionallylarge values for the product of all thepreceding weight components.
The School Base Weight
The term is referred to as the school baseweight. For the systematic probability-proportional-to-size school sampling methodused in PISA, this is given as:
.(1)
The term denotes the measure of sizegiven to each school on the sampling frame.Despite country variations, was usuallyequal to the (estimated) number of 15-year-oldsin the school, if it was greater than the predeter-mined target cluster size (35 in most countries).If the enrolment of 15-year-olds was less than theTarget Cluster Size (TCS), then . Inaddition in countries where a stratum of very smallschools was used, if the number of 15-year-oldsin a school was below , then .
The term denotes the samplinginterval used within the explicit samplingstratum g that contains school I and is calculatedas the total of values for all schools instratum g, divided by the school sample size forthat stratum.
Thus, if school i was estimated to have 10015-year-olds at the time of sample selection,
. If the country had a single explicitstratum (g=1) and the total of the valuesover all schools was 150 000, with a school
mos i( )mos i( ) =100
mos i( )
int g / i( )mos i TCS( ) = 2TCS 2
mos i TCS( ) =
mos i( )
mos i( )
wg
mos i mos i gi1
1=
( ) <
int / i int / i( ) ( ) ( )if
otherwise
w i1
w i1
f i2
f ijA
1
f i1
w ij2
w1i
W t f f f t w wij ij i i ijA
i ij i= 2 1 2 1 1 2 1
Wij
PIS
A 2
00
0 t
ech
nic
al r
epor
t
90
sample size of 150, then int(1/i) = 150 00/150=1 000, for school i (and others in the sample),giving wli = 1 000/100 = 10.0. Roughly speaking,the school can be thought of as representingabout 10 schools from the population. In thisexample, any school with 1 000 or more 15-year-old students would be included in thesample with certainty, with a base weight of wli = 1.
The School Weight Trimming Factor
Once school base weights were established for eachsampled school in the country, verifications weremade separately within each explicit samplingstratum to see if the school weights required trim-ming. The school trimming factor t1i, is the ratioof the trimmed to the untrimmed school baseweight, and is equal to 1.0000 for most schoolsand therefore most students, and never exceedsthis value. (See Table 23 for the number of schoolrecords in each country that received some kindof base weight trimming.)
The first check was for schools that had been as-signed very small selection probabilities because ofvery small enrolments, although the school samplingprocedure described in the Sampling Manual andin Chapter 4 of this report was designed to preventthis. Cases did arise where schools had very smallprobabilities of selection relative to others in thesame stratum because they were assigned a valueof that was much smaller than ,usually because the actual enrolment was used asthe value of , even when it was much smallerthan . Where the value for TCS was 35,as in most jurisdictions, if a school with one15-year-old enrolled student was given of 1 rather than 17.5 (or 35, depending on whetheror not a small school stratum was used), the schoolbase weight was very large. The sampled students inthat school would have received a weight 35 timesgreater than that of a student in the sample froma school with 35 15-year-old students.
These schools were given a compromiseweight neither excessively large nor introducinga bias by under-representing students from verysmall schools (that often constituted a sizeablefraction of the PISA student population). This wasdone by essentially replacing the inappropriate
with TCS/6, giving them and their studentsweights six times as great as that of most studentsin the country and three times as great as theywould have had if the small school sampling
option been implemented correctly. Permittingthese schools to have weights three times as greatas they would have had, had they been sampledcorrectly, was also consistent with the othertrimming procedures, described below.
The second school-level trimming adjustmentwas applied to schools that turned out to be muchlarger than was believed at the time of sampling—where 15-year-old enrolment exceeded
. For example, if TCS = 35,then a school flagged for trimming had morethan 105 PISA-eligible students, and more thanthree times as many students as was indicated onthe school sampling frame. Because the studentsample size was set at TCS regardless of the actualenrolment, the student sampling rate was muchlower than anticipated during the school sampling.This meant that the weights for the sampledstudents in these schools would have been morethan three times greater than anticipated whenthe school sample was selected. These schoolshad their school base weights trimmed by having
replaced by in theschool base weight formula.
The Student Base Weight
The term is referred to as the student baseweight. With the PISA procedure for samplingstudents, did not vary across students (j)within a particular school i. Thus is given as:
, (2)
where is the actual enrolment of 15-year-olds in the school (and so, in general, issomewhat different from the estimated ),and is the sample size within school i. Itfollows that if all students from the school wereselected, then for all eligible students inthe school. For all other cases .
School Non-response Adjustment
In order to adjust for the fact that those schoolsthat declined to participate, and were notreplaced by a replacement school, were not ingeneral typical of the schools in the sample as awhole, school-level non-response adjustmentswere made. Several groups of somewhat similarschools were formed within a country, and within
w ij2 1>w ij2 1=
sam i( )mos i( )
enr i( )
w enr isam iij2 = ( )
( )
w ij2
w ij2
w ij2
3× ( )max , ( )TCS mos imos i( )
3× ( )max , ( )TCS mos i
mos i( )
mos i( )
TCS 2mos i( )
TCS 2mos i( )
Surv
ey w
eigh
tin
g an
d t
he
calc
ula
tion
of
sam
pli
ng
vari
ance
91
each group the weights of the responding schoolswere adjusted to compensate for the missingschools and their students. The compositions ofthe non-response groups varied from country tocountry, but were based on cross-classifying theexplicit and implicit stratification variables usedat the time of school sample selection. Usually,about 10 to 15 such groups were formed withina given country depending upon schooldistribution with respect to stratificationvariables. If a country provided no implicitstratification variables, schools were divided intothree roughly equal groups, within each stratum,based on their size (small, medium or large). Itwas desirable to ensure that each group had atleast six participating schools, as small groupscan lead to unstable weight adjustments, whichin turn would inflate the sampling variances.However, it was not necessary to collapse cellswhere all schools participated, as the school non-response adjustment factor was 1.0 regardless ofwhether cells were collapsed or not. Adjustmentsgreater than 2.0 were flagged for review, as theycan cause increased variability in the weights,and lead to an increase in sampling variances. Ineither of these situations, cells were generallycollapsed over the last implicit stratificationvariable(s) until the violations no longer existed.In countries with very high overall levels ofschool non-response after school replacement, the
requirement for school non-response adjustmentfactors all to be below 2.0 was waived.
Within the school non-response adjustmentgroup containing school i, the non-responseadjustment factor was calculated as:
, (3)
where the sum in the denominator is over Γ(i),the schools within the group (originals and replace-ments) that participated, while the sum in thenumerator is over Ω(i), those same schools, plusthe original sample schools that refused and werenot replaced. The numerator estimates thepopulation of 15-year-olds in the group, whilethe denominator gives the size of the populationof 15-year-olds directly represented by participatingschools. The school non-response adjustment factorensures that participating schools are weighted torepresent all students in the group. If a school didnot participate because it had no eligible studentsenrolled, no adjustment was necessary since thiswas neither non-response nor under-coverage.
Table 22 shows the number of school non-response classes that were formed for each country,and the variables that were used to create the cells.
f
w enr k
w enr ki
kk i
kk i
1
1
1
=
( )
( )∈ ( )
∈ ( )
∑∑
Ω
Γ
PIS
A 2
00
0 t
ech
nic
al r
epor
t
92
Table 22: Non-Response Classes
Country Variables Used to Create Non-Response Classes Original FinalNumber Numberof Cells of Cells After
Collapsing Small Cells
Australia Metropolitan/Country 42 24
Austria No school non-response adjustments
Belgium (Fl.) For strata 1-3, School Type and Organisation 27 8
Belgium (Fr.) 5 categories of the school proportion of overage students 14 4
Brazil Public/Private, Region, Score Range 104 50
Canada Public/Private, Urban/Rural 134 64
Czech Republic 3 School Sizes 65 65
Denmark First 2 digits of concatenated School Type and County 78 24
Surv
ey w
eigh
tin
g an
d t
he
calc
ula
tion
of
sam
pli
ng
vari
ance
93
Table 22 (cont.)
Country Variables Used to Create Non-Response Classes Original FinalNumber Numberof Cells of Cells After
Collapsing Small Cells
England For very small schools: School Type, Region; 50 17For other schools: School Type, Exam Results (LEA or Grant Maintained schools) or Co-educational Status (Independent Schools), Region
Finland No school non-response adjustments
France 3 School Sizes 18 10
Germany For strata 66 and 67, State was used. For all others, 60 29the explicit stratum was used.
Greece School Type 30 14
Hungary Combinations of School Type, Region, and Location 100 31(75 values)
Iceland 3 School Sizes 21 14
Ireland School Type, Gender Composition 16 9
Italy No school non-response adjustments
Japan 3 School Sizes 12 10
Korea No school non-response adjustments
Latvia Urbanisation, School Type 29 10
Liechtenstein No school non-response adjustments
Luxembourg 3 School Sizes 6 4
Mexico No school non-response adjustments
Netherlands 3 School Sizes 15 7
New Zealand Public/Private, School Gender composition, 15 9Socio-Economic Status, Urban/Rural
Northern Ireland School Type, Examination Results, Region 30 14
Norway 3 School Sizes 15 9
Poland 3 School Sizes 9 9
Portugal First 2 digits of Public/Private, 3 digits of Region and 22 11the Index of Social Development for 3 regions.
Russian Federation School Program, Urbanisation 47 46
Scotland For strata 1, Average School Achievement and the 49 11first 2 digits of the National ID; for strata 2, the first 2 digits of the National ID.
Spain No school non-response adjustments
Sweden Combinations of the implicit stratification variables 15 11
Switzerland Pseudo Sampling Strata and 3 School Sizes 49 37
United States Public/Private, High/Low Minority and 101 18Private School Type, PSU
PIS
A 2
00
0 t
ech
nic
al r
epor
t
94
Grade Non-response Adjustment
In a few countries, several schools agreed to participate in PISA but required that participation berestricted to 15-year-olds in the modal grade for 15-year-olds, rather than all 15-year-olds, because ofperceived administrative inconvenience. Since the modal grade generally included the majority of thepopulation to be covered, some of these schools were accepted as participants. For the part of the15-year-old population in the modal grade, these schools were respondents, while for the rest of thegrades in the school with 15-year-olds, this school was a refusal. This situation occasionally arose for agrade other than the modal grade because of other reasons, such as other testing being carried out forcertain grades at the same time as the PISA assessment. To account for this, a special non-responseadjustment was calculated at the school level for students not in the modal grade (and was automatically1.0 for all students in the modal grade).
Within the same non-response adjustment groups used for creating school non-response adjustmentfactors, the grade non-response adjustment factor for all students in school i, , is given as:
(4)
The variable is the approximate number of 15-year-old students in school k but not in themodal grade. The set B(i) is all schools that participated for all eligible grades (from within the non-response adjustment group with school (i)), while the set C(i) includes these schools and those that onlyparticipated for the modal responding grade.
This procedure gave a single grade non-response adjustment factor for each school, which dependedupon its non-response adjustment class. Each individual student received this factor value if they did notbelong to the modal grade, and 1.0000 if they belonged to the modal grade. In general, this factor is notthe same for all students within the same school.
enra k( )
f1ijA =
w1kenra k( )k ∈ C i( )∑
w1kenra k( ).k∈Β i( )∑ for students not in the modal grade
1 otherwise
f ijA
1
Student Non-response Adjustment
Within each participating school, the studentnon-response adjustment was calculated as:
, (5)
where the set is all assessed students in theschool and the set is all assessed students inthe school plus all others who should have beenassessed (i.e., who were absent, not excluded orineligible).
In most cases, this student non-responsefactor reduces to the ratio of the number ofstudents who should have been assessed to thenumber who were assessed. In some cases ofsmall schools (fewer than 15 respondents), itwas necessary to collapse schools together, andthen the more complex formula above applied.
Additionally, an adjustment factor greater than2.0 was not allowed for the same reasons notedunder school non-response adjustments. If thisoccurred, the school with the large adjustmentwas collapsed with the next school in the sameschool non-response cell.
Some schools in some countries had very lowstudent response levels. In these cases it wasdetermined that the small sample of assessedstudents was potentially too biased as arepresentation of the school to be included in thePISA data. For any school where the studentresponse rate was below 25 per cent, the schoolwas therefore treated as a non-respondent, andits student data were removed. In schools withbetween 25 and 50 per cent student response, thestudent non-response adjustment described abovewould have resulted in an adjustment factor ofbetween 2.0000 and 4.0000, and so these schoolswere collapsed with others to create student non-response adjustments.
X i( )∆ i( )
f 2i =
f1iw1iw2ikk∈Χ i( )∑
f1iw1iw2ikk ∈∆ i( )∑
f i2
Surv
ey w
eigh
tin
g an
d t
he
calc
ula
tion
of
sam
pli
ng
vari
ance
95
(Chapter 12 describes these schools as beingtreated as non-respondents for the purpose ofresponse rate calculation, even though theirstudent data were used in the analyses.)
Trimming Student Weights
This final trimming check was used to detectstudent records that were unusually largecompared to those of other students within thesame explicit stratum. The sample design wasintended to give all students from within thesame explicit stratum an equal probability ofselection and therefore equal weight, in theabsence of school and student non-response. Asalready noted, inappropriate school samplingprocedures and poor prior information about thenumber of eligible students in each school couldlead to substantial violations of this principle.Moreover, school, grade, and student non-response adjustments, and, occasionally,inappropriate student sampling could, in a fewcases, accumulate to give a few students in thedata relatively very large weights, which addsconsiderably to sampling variance. The weightsof individual students were therefore reviewed,and where the weight was more than four timesthe median weight of students from the sameexplicit sampling stratum, it was trimmed to beequal to four times the median weight for thatexplicit stratum.
The student trimming factor t2ij is equal to theratio of the final student weight to the studentweight adjusted for student non-response, andtherefore equal to 1.0000 for the great majorityof students. The final weight variable on thedata file was called w_fstuwt, which is the finalstudent weight that incorporates any student-level trimming. Table 23 shows the number ofstudents with weights trimmed at this point inthe process (i.e., ) for each countryand the number of schools for which the schoolbase weight was trimmed (i.e., ).t i1 1 0000< .
t ij2 1 0000< .
Table 23: School and Student Trimming
Country Schools StudentsTrimmed Trimmed
Australia 1 0
Austria 45 0
Belgium (Fl.) 0 0
Belgium (Fr.) 0 0
Brazil 2 0
Canada 2 0
Czech Republic 1 0
Denmark 1 0
England 0 0
Finland 0 0
France 0 0
Germany 0 0
Greece 1 0
Hungary 2 0
Iceland 0 0
Ireland 0 0
Italy 1 0
Japan 0 37
Korea 1 4
Latvia 1 0
Liechtenstein 0 0
Luxembourg 0 0
Mexico 1 23
Netherlands 0 0
New Zealand 0 0
Northern Ireland 0 0
Norway 0 0
Poland 1 0
Portugal 0 0
Russian Federation 3 39
Scotland 0 0
Spain 5 11
Sweden 0 0
Switzerland 2 0
United States 0 0
Subject-specific Factors forMathematics and Science
The weights described above are appropriate foranalysing data collected from all assessed students—i.e., questionnaire and reading achievement data.Because a special booklet (SE or Booklet 0) wasused in some countries for certain kinds ofstudents and not at random, additional weightingfactors are required to analyse data obtainedfrom only a subset of the 10 PISA test booklets,particularly for analysing mathematics andscience scale scores. These additional weightingfactors were calculated and included with thedata.
The mathematics weight factor was given as:1.0 for each student assigned Booklet 0; 1.8 foreach student assigned Booklet 1, 3, 5, 8, or 9;and 0.0 for each student assigned Booklet 2, 4,6, or 7, which contained no mathematics items.
The science weight factor was given as: 1.0for each student assigned Booklet 0; 1.8 for eachstudent assigned Booklet 2, 4, 6, 8, or 9; and 0.0for each student assigned Booklet 1, 3, 5, or 7,which contained no science items.
Calculating SamplingVariance
To estimate the sampling variances of PISAestimates, a replication methodology wasemployed. This reflected the variance inestimates due to the sampling of schools andstudents. Additional variance due to the use ofplausible values from the posterior distributionsof scaled scores was captured separately,although computationally the two componentscan be carried out in a single program, such asWesVar 4 (Westat, 2000).
The Balanced Repeated ReplicationVariance Estimator
The approach used for calculating samplingvariances for PISA is known as BalancedRepeated Replication (BRR), or Balanced Half-Samples; the particular variant known as Fay’smethod was used. This method is very similar innature to the jackknife method used in previousinternational studies of educational achievement,such as TIMSS. It is well documented in thesurvey sampling literature (see Rust, 1985; Rustand Rao, 1996; Shao, 1996; Wolter, 1985). The
major advantage of BRR over the jackknife isthat the jackknife method is not fullyappropriate for use with non-differentiablefunctions of the survey data, most noticeablyquantiles. It provides unbiased estimates, but notconsistent ones. This means that, dependingupon the sample design, the variance estimatorcan be very unstable, and despite empiricalevidence that it can behave well in a PISA-likedesign, theory is lacking. In contrast BRR doesnot have this theoretical flaw. The standard BRRprocedure can become unstable when used toanalyse sparse population subgroups, but Fay’smodification overcomes this difficulty, and iswell justified in the literature (Judkins, 1990).
The BRR approach was implemented asfollows, for a country where the student samplewas selected from a sample of rather than allschools:• Schools were paired on the basis of the
explicit and implicit stratification and frameordering used in sampling. The pairs wereoriginally sampled schools, or pairs thatincluded a participating replacement if anoriginal refused. For an odd number ofschools within a stratum, a triple was formedconsisting of the last school and the pairpreceding it.
• Pairs were numbered sequentially, 1 to H,with pair number denoted by the subscript h.Other studies and the literature refer to suchpairs as variance strata or zones, or pseudo-strata.
• Within each variance stratum, one school (thePrimary Sampling Unit, PSU) was randomlynumbered as 1, the other as 2 (and thethird as 3, in a triple), which defined thevariance unit of the school. Subscript j refersto this numbering.
• These variance strata and variance units (1, 2,3) assigned at school level are attached to thedata for the sampled students within thecorresponding school.
• Let the estimate of a given statistic from thefull student sample be denoted as . This iscalculated using the full sample weights.
• A set of 80 replicate estimates, (where truns from 1 to 80), was created. Each of thesereplicate estimates was formed by multiplyingthe sampling weights from one of the twoprimary sampling units (PSUs) in each stratumby 1.5, and the weights from the remaining
Xt*
X*
PIS
A 2
00
0 t
ech
nic
al r
epor
t
96
PSUs by 0.5. The determination as to whichPSUs received inflated weights, and whichreceived deflated weights, was carried out in asystematic fashion, based on the entries in aHadamard matrix of order 80. A Hadamardmatrix contains entries that are +1 and –1 invalue, and has the property that the matrix,multiplied by its transpose, gives the identitymatrix of order 80, multiplied by a factor of 80.Examples of Hadamard matrices are given inWolter (1985).
• In cases where there were three units in atriple, either one of the schools (designated atrandom) received a factor of 1.7071 for agiven replicate, with the other two schoolsreceiving factors of 0.6464, or else the oneschool received a factor of 0.2929 and theother two schools received factors of 1.3536.The explanation of how these particularfactors came to be used is explained inAppendix 12.
• To use a Hadamard matrix of order 80 requiresthat there be no more than 80 variance stratawithin a country, or else that some combiningof variance strata be carried out prior toassigning the replication factors via theHadamard matrix. The combining of variancestrata does not cause any bias in varianceestimation, provided that it is carried out insuch a way that the assignment of varianceunits is independent from one stratum toanother within strata that are combined. Thatis, the assignment of variance units must becompleted before the combining of variancestrata takes place, and this approach was usedfor PISA.
• The reliability of variance estimates forimportant population subgroups is enhancedif any combining of variance strata that isrequired is conducted by combining variancestrata from different subgroups. Thus in PISA,variance strata that were combined wereselected from different explicit sampling strataand, to the extent possible, from differentimplicit sampling strata also.
• In some countries, it was not the case that theentire sample was a two-stage design, of firstsampling schools and then sampling students.In some countries for part of the sample (andfor the entire samples for Iceland,Liechtenstein and Luxembourg), schools wereincluded with certainty into the sampling, so
that only a single stage of student samplingwas carried out for this part of the sample. Inthese cases instead of pairing schools, pairs ofindividual students were formed from withinthe same school (and if the school had an oddnumber of sampled students, a triple ofstudents was formed also). The procedure ofassigning variance units and replicate weightfactors was then conducted at the studentlevel, rather than at the school level.
• In contrast, in a few countries there was astage of sampling that preceded the selectionof schools, for at least part of the sample.This was done in a major way in the RussianFederation and the United States, and in amore minor way in Germany and Poland. Inthese cases there was a stage of sampling thattook place before the schools were selected.Then the procedure for assigning variancestrata, variance units and replicate factors wasapplied at this higher level of sampling. Theschools and students then inherited theassignment from the higher-level unit in whichthey were located.
• The variance estimator is then:
. (6)
The properties of BRR have been establishedby demonstrating that it is unbiased and consistentfor simple linear estimators (i.e., means fromstraightforward sample designs), and that it hasdesirable asymptotic consistency for a widevariety of estimators under complex designs, andthrough empirical simulation studies.
Reflecting Weighting Adjustments
This description glosses over one aspect of theimplementation of the BRR method. Weights fora given replicate are obtained by applying theadjustment to the weight components that reflectselection probabilities (the school base weight inmost cases), and then re-computing the non-response adjustment replicate by replicate.
Implementing this approach required that theConsortium produce a set of replicate weights inaddition to the full sample weight. Eighty suchreplicate weights were needed for each student inthe data file. The school and student non-response adjustments had to be repeated foreach set of replicate weights.
V X X XBRR tt
∗ ∗ ∗
=
( ) = −( ) ∑0 05 2
1
80
.
Surv
ey w
eigh
tin
g an
d t
he
calc
ula
tion
of
sam
pli
ng
vari
ance
97
To estimate sampling errors correctly, theanalyst must use the variance estimation formulaabove, by deriving estimates using the t-thset of replicate weights instead of the full sampleweight. Because of the weight adjustments (andthe presence of occasional triples), this does notmean merely increasing the final full sampleweights for half the schools by a factor of 1.5and decreasing the weights from the remainingschools by a factor of 0.5. Many replicateweights will also be slightly disturbed, beyondthese adjustments, as a result of repeating thenon-response adjustments separately byreplicate.
Formation of Variance Strata
With the approach described above, all originalsampled schools were sorted in stratum order(including refusals and ineligibles) and paired, bycontrast to other international educationassessments such TIMSS and TIMSS-R that havepaired participating schools only. However, thesestudies did not use an approach reflecting theimpact of non-response adjustments on samplingvariance. This is unlikely to be a big componentof variance in any PISA country, but theprocedure gives a more accurate estimate ofsampling variance.
Countries Where All Students WereSelected for PISA
In Iceland, Liechtenstein and Luxembourg, alleligible students were selected for PISA. It mightbe considered surprising that the PISA datashould reflect any sampling variance in thesecountries, but students have been assigned tovariance strata and variance units, and the BRRformula does give a positive estimate ofsampling variance for three reasons. First, ineach country there was some student non-response, and, in the case of Luxembourg, someschool non-response. Not all eligible studentswere assessed, giving sampling variance. Second,only 55 per cent of the students were assessed inmathematics and science. Third, the issue is tomake inference about educational systems andnot particular groups of individual students, so itis appropriate that a part of the samplingvariance reflect random variation betweenstudent populations, even if they were to be
subjected to identical educational experiences.This is consistent with the approach that isgenerally used whenever survey data are used totry to make direct or indirect inference aboutsome underlying system.
Xt*
PIS
A 2
00
0 t
ech
nic
al r
epor
t
98
Scaling PISA Cognitive Data
Ray Adams
The mixed coefficients multinomial logit modelas described by Adams, Wilson and Wang(1997) was used to scale the PISA data, andimplemented by ConQuest software (Wu, Adamsand Wilson, 1997).
The Mixed CoefficientsMultinomial Logit Model
The model applied to PISA is a generalised formof the Rasch model. The model is a mixedcoefficients model where items are described bya fixed set of unknown parameters ξ, while thestudent outcome levels (the latent variable), θ, isa random effect.
Assume that I items are indexed with each item admitting Ki + 1 responsecategories indexed . Use the vectorvalued random variable,
where
(7)
to indicate the Ki + 1 possible responses to item i.
A vector of zeroes denotes a response incategory zero, making the zero category areference category, which is necessary for modelidentification. Using this as the referencecategory is arbitrary, and does not affect thegenerality of the model. The Xi can also becollected together into the single vector
, called the response vector
(or pattern). Particular instances of each of these
random variables are indicated by their lowercase equivalents; x, xi and xik.
Items are described through a vector, of p parameters. Linear
combinations of these are used in the responseprobability model to describe the empiricalcharacteristics of the response categories of eachitem. Design vectors ,each of length p, which can be collected to form a design matrix
define these linear combinations.The multi-dimensional form of the model
assumes that a set of D traits underlies theindividuals’ responses. The D latent traits definea D-dimensional latent space. The vector
, represents an individual’sposition in the D-dimensional latent space.
The model also introduces a scoring functionthat allows the specification of the score orperformance level assigned to each possibleresponse category to each item. To do so, thenotion of a response score bijd is introduced,which gives the performance level of an observedresponse in category j, item i, dimension d. Thescores across D dimensions can be collected intoa column vector andagain collected into the scoring sub-matrix foritem i, and then into ascoring
matrix for the entire test.
(The score for a response in the zero category iszero, but other responses may also be scoredzero).
B B B B= ( )1 2
T TIT T
, , ,K
B b b bi i i iDT= ( )1 2, , ,K
bik ik ik ikDT
b b b= ( )1 2, , ,K
θθ =( )′θ θ θ1 2, , ,K D
A a a a a a aT
K K IK I= ( )11 12 1 21 21 2
, , , , , , , ,K K K
a ij ii I j K, , , ; ,= =( )1 1K K
ξξT
p= ( )ξ ξ ξ1 2, , ,K
X X X XT T T
IT= ( )1 2, , ,K
Xi j
ij =
1
0
if response to item is in category
otherwise
,
X i i i iK
TX X X
i= ( )1 2, , , ,K
k Ki= 0 1, , ,K
i I=1, ,K
Chapter 9
The probability of a response in category j of item i is modelled as:
(8)
For a response vector
(9)
with , (10)
where Ω is the set of all possible response vectors.
The Population Model
The item response model is a conditional model, in the sense that it describes the process of generatingitem responses conditional on the latent variable, θ. The complete definition of the model, therefore,requires the specification of a density, , for the latent variable, θ. Let α symbolise a set ofparameters that characterise the distribution of θ. The most common practice, when specifying uni-dimensional marginal item response models, is to assume that students have been sampled from a normalpopulation with mean µ and variance σ2. That is:
, (11)
or equivalently, (12)
where .
Adams, Wilson and Wu (1997) discuss how a natural extension of (11) is to replace the mean, µ, withthe regression model, , where is a vector of u, fixed and known values for student n, and β is thecorresponding vector of regression coefficients. For example, Yn could be constituted of student variablessuch as gender or socio-economic status. Then the population model for student n becomes:
, (13)
where it is assumed that the En are independently and identically normally distributed with mean zeroand variance σ2 so that (13) is equivalent to:
, (14)
a normal distribution with mean and variance σ2. If (14) is used as the population model then theparameters to be estimated are β, σ2 and ξ.
The generalisation needs to be taken one step further to apply it to the vector valued θ rather than thescalar valued θ. The extension results in the multivariate population model:
, (15)
where γ is a u×d matrix of regression coefficients, Σ is a d×d variance-covariance matrix and Wn is a u×1vector of fixed variables.
In PISA, the Wn variables are referred to as conditioning variables.
f n nd
n nT
n nθθ−−11θθ γγ ΣΣ ΣΣ θθ γγ ΣΣ θθ γγ; , , exp — —W W W( ) = ( ) − ( ) ( )
− −2
1
22 1
2π
YnTββ
f n n n nT T
n nT
θ θ σ πσσ
θ θ; , , exp — —Y Y Yb 2 2 1 2
221
2( ) = ( ) − ( ) ( )
−ββ ββ
θn nT
nE= +Y ββ
YnYnTββ
E N~ ,0 2σ( )θ µ= + E
f fθθ θθ αα; ; , exp( ) ≡ ( ) = ( ) −−( )
−
θ θ µ σ πσθ µ
σ2 2
12
2
222
fθθ θθ αα;( )
ΨΩ
θθ ξξ θθ ξξ, exp( ) = +( )[ ]
∈
−
∑ z B Az
T
1
f x x B A; | , exp ,ξξ θθ θθ ξξ θθ ξξ( ) = ( ) ′ +( )[ ]Ψ
Pr ; , , |exp
exp
.X A Bb a
b a
ijij ij
ik ikk
K=( ) =+ ′( )
+ ′( )∑1
1
ξξ θθθθ ξξ
θθ ξξ=
i
PIS
A 2
00
0 t
ech
nic
al r
epor
t
100
Combined Model
In (16), the conditional item response model (12) and the population model (15) are combined to obtainthe unconditional, or marginal, item response model:
. (16)
It is important to recognise that under this model, the locations of individuals on the latent variablesare not estimated. The parameters of the model are γ, Σ and ξ.
The procedures used to estimate model parameters are described in Adams, Wilson and Wu (1997),Adams, Wilson and Wang (1997), and Wu, Adams and Wilson (1997).
For each individual it is possible however to specify a posterior distribution for the latent variable,given by:
(17)
hf f
f
f f
f f
n n nn n n n
n n
n n n n
n n n n
n
θθθθ
θθ
θθθθ
θθ ξξ γγ ΣΣξξ θθ θθ γγ ΣΣ
ξξ γγ ΣΣ
ξξ θθ θθ γγ ΣΣ
ξξ θθ θθ γγ ΣΣ
; , , , |; | ; , ,
; , , ,
; | ; , ,
; | ; , , .
W xx W
x W
x W
x W
x
x
x
x
( ) =( ) ( )
( )
=( ) ( )( ) ( )∫
f f f dx xx x; , ; | ; ,ξξ,, γγ ΣΣ ξξ θθ θθ γγ ΣΣ θθθθθθ
( ) = ( ) ( )∫
Scal
ing
PIS
A c
ogn
itiv
e d
ata
101
Application to PISA
In PISA, this model was used in three steps:national calibrations; international scaling; andstudent score generation.
For both the national calibrations and theinternational scaling, the conditional itemresponse model (9) is used in conjunction withthe population model (15), but conditioningvariables are not used. That is, it is assumed thatstudents have been sampled from a multivariatenormal distribution.
The PISA model is five-dimensional, made upof three reading, one science and onemathematics dimension. The design matrix waschosen so that the partial credit model was usedfor items with multiple score categories and thesimple logistic model was fit to thedichotomously scored items.
National Calibrations
National calibrations were performed separatelycountry-by-country using unweighted data. Theresults of these analyses, which were used tomonitor the quality of the data and to makedecisions regarding national item treatment, aregiven in Chapter 13.
The outcomes of the national calibrationswere used to make a decision about how to treateach item in each country. This means that: an
item may be deleted from PISA altogether if ithas poor psychometric characteristics in morethan eight countries (a dodgy item); it may beregarded as not-administered in particularcountries if it has poor psychometriccharacteristics in those countries but functionswell in the vast majority of others; or an itemwith sound characteristics in each country butwhich shows substantial item-by-countryinteractions may be regarded as a different item(for scaling purposes) in each country (or insome subset of countries) that is, the difficultyparameter will be free to vary across countries.
Both the second and third options have thesame impact on comparisons between countries.That is, if an item is identified as behavingdifferently in different countries, choosing eitherthe second or third option will have the sameimpact on inter-country comparisons. The choicebetween them could, however, influence within-country comparisons.
When reviewing the national calibrations,particular attention was paid to the fit of theitems to the scaling model, item discriminationand item-by-country interactions.
Item Response Model Fit (Infit MeanSquare)
For each item parameter, the ConQuest fit meansquare statistic index (Wu, 1997) was used to
provide an indication of the compatibility of themodel and the data. For each student, the modeldescribes the probability of obtaining the differentitem scores. It is therefore possible to compare themodel prediction and what has been observed forone item across students. Accumulating comparisonsacross cases gives us an item-fit statistic.
As the fit statistics compare an observed valuewith a predicted value, the fit is an analysis ofresiduals. In the case of the item infit meansquare, values near one are desirable. An infitmean square greater than one is often associatedwith a low discrimination index, and an infitmean square less than one is often associatedwith a high discrimination index.
Discrimination Coefficients
For each item, the correlation between thestudents’ scores on that item and their aggregatescores on the set for the same domain andbooklet as the item of interest was used as anindex of discrimination. If pij (= xij/ mI) is theproportion of score levels that student i achieved
on item j, and pI , (where the summation
is of the items from the same booklet anddomain as item j) is the sum of the proportionsof the maximum score achieved by student i,then the discrimination is calculated as theproduct-moment correlation between pij and pifor all students. For multiple-choice and short-answer items, this index will be the usual point-biserial index of discrimination.
The point-biserial index of discrimination fora particular category of an item is a comparisonof the aggregate score between students selectingthat category and all other students. If thecategory is the correct answer, the point-biserialindex of discrimination should be higher than0.25. Non-key categories should have a negativepoint-biserial index of discrimination. The point-biserial index of discrimination for a partialcredit item should be ordered, i.e., categoriesscored 0 should be lower than the point-biserialcorrelation of categories scored 1, and so on.
Item-by-Country Interaction
The national scaling provides nationally specificitem parameter estimates. The consistency of itemparameter estimates across countries was of partic-ular interest. If the test measured the same latent
trait per domain in all countries, then items shouldhave the same relative difficulty, or, more precisely,would fall within the interval defined by thestandard error on the item parameter estimate.
National Reports
After national scaling, five reports were returnedto each participating country to assist inreviewing their data with the Consortium:1
• Report 1 presented the results of a basic itemanalysis in tabular form. For each item, thenumber of students, the percentage ofstudents and the point-biserial correlationwere provided for each valid category.
• Report 2 provided, for each item and for eachvalid category, the point-biserial correlationand the student-centred Item Response Theory(IRT) ability average in graphical form.
• Report 3 provided a graphical comparison ofthe Item Infit Mean Square coefficients andthe item discrimination coefficients computedat national and international levels.
• Report 4 provided a graphical comparison ofboth the item difficulty parameter and theitem thresholds,2 computed at national andinternational levels.
• Report 5 listed the items that National ProjectManagers (NPMs) needed to check formistranslation and/or misprinting, referred toas dodgy items.
Report 1: Descriptive Statistics on IndividualItems in Tabular Form
A detailed item-by-item report was provided intabular form showing the basic item analysisstatistics at the national level (see Figure 12 foran example).
The table shows each possible responsecategory for each item. The second columnindicates the score assigned to the differentcategories. For each category, the number andpercentage of students responding is shown,along with the point-biserial correlation and theassociated t statistic. Note that for the item inthe example the correct answer is ‘4’, indicatedby the ‘1’ in the score column; thus the point-biserial for a response of ‘4’ is the item’sdiscrimination index, also shown along the top.
= ∑ pijj
PIS
A 2
00
0 t
ech
nic
al r
epor
t
102
1 In addition, two reports showing results from theStudent and School Questionnaires were also returnedto participants.2 A threshold for an item score is the point on thescale at which the probability of a response at thatscore or higher becomes greater than 50 per cent.
2
1
0
-1
-2
0.6
0.30.0
-0.3-0.6
0 1 2 3 4 5 6 7 8 9 R
0 1 2 3 4 5 6 7 8 9 R
0 17 127 102 1053 0 0 0 2 57 14 (no. of students)
0 1 9 7 77 0 0 0 0 4 1 (%)Average Ability by Category
Item: 1M033Q01
Discrimination: 0.39
Key: 4
PointBiserial byCategory
The report shows two kinds of missing data:row 9 indicates students who omitted the itembut responded validly to at least one subsequentitem; row r shows students who did not reachthis item.
Report 2: Descriptive Statistics on IndividualItems in Graphical Form
Report 2 (see Figure 13) graphs the abilityaverage and the point-biserial correlation bycategory. Average Ability by Category iscalculated by domain and centred for each item.This makes it easy to identify positive andnegative ability categories, so that checks can be
made to ensure that, for multiple-choice items,the key category has the highest average abilityestimate, and for constructed-response items, themean abilities are ordered consistently with thescore levels. The displayed graphs also facilitatethe process of identifying the following anomalies:• a non-key category with a positive point-biserial
or a point-biserial higher than the key category;• a key category with a negative point-biserial;
and• for partial-credit items, average abilities (and
point-biserials) not increasing with the scorepoints.
Scal
ing
PIS
A c
ogn
itiv
e d
ata
103
Item 1------item:1 (M033Q01)Cases for this item 1372 Discrimination is 0.39-------------------------------------------------------------------- Label Score Count % of total Pt Bis t
-------------------------------------------------------------------- 0 0 0.00 NA NA 1 0.00 17 1.28 -0.08 -3.14 2 0.00 127 9.57 -0.19 -7.15 3 0.00 102 7.69 -0.12 -4.57 4 1.00 1053 79.35 0.39 16.34 5 0 0.00 NA NA 6 0 0.00 NA NA 7 0 0.00 NA NA 8 0.00 2 0.15 0.02 0.61 9 0.00 57 4.30 -0.22 -8.47 r 0.00 14 1.06 -0.24 -9.15
====================================================================
Figure 12: Example of Item Statistics Shown in Report 1
Figure 13: Example of Item Statistics Shown in Report 2
Report 3: Comparison of National andInternational Infit Mean Square andDiscrimination Coefficients
The national scaling provided the infit meansquare, the point-biserial correlation, the itemparameter estimate (or difficulty estimate) andthe thresholds for each item in each country.Reports 3 (see Figure 14) and 4 (see Figure 15)compare the value computed for one countrywith those computed for all other countries andwith the value computed at international levelfor each item.
The black crosses present the values of thecoefficients computed from the international
database. Shaded boxes represent the mean plusor minus one standard deviation of thesenational values. Shaded crosses represent thevalues for the national data set of the country towhich the report was returned.
Substantial differences between the nationaland international value on one or both of theseindices show that the item is behaving differentlyin that country. This might reflect amistranslation or another problem specific to thenational version, but if the item wasmisbehaving in all or nearly all countries, itmight reflect a specific problem in the sourceitem and not with the national versions.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
104
Figure 14: Example of Item Statistics Shown in Report 3
Figure 15: Example of Item Statistics Shown in Report 4
Delta (Item Difficulty)
-1.5 0.0 1.5
Item-Category Threshold
-1.5 0.0 1.5 (value)
M033Q01
Threshold No. 1
X
X X
X
-1.51
-1.84
-1.51
-1.84
International
National value Summary over all nationalvalues (mean +/- 1 SD)
Delta Infit Mean Square Discrimination Index
0.70 1.00 1.30 (value) 0.00 0.25 0.50 (value)
M033Q01 X
X
X
X1.13
1.10 0.39
0.39
National value Summary over all nationalvalues (mean +/- 1 SD)
International
Report 4: Comparison of National andInternational Item Difficulty Parameters andThresholds
Report 4 presents the item difficulty parametersand the thresholds, in the same graphic form asReport 3. Substantial differences between thenational value and the international value (i.e.,the national value mean) might be interpretedlike an item-by-country interaction.Nevertheless, appropriate estimates of the item-by-country interaction are provided in Report 5.
Report 5: National Dodgy Item ReportFor each country’s dodgy items, Report 5 listswhere the items were flagged for one or more ofthe following reasons: difficulty is significantlyeasier or harder than average; a non-keycategory has a point-biserial correlation higherthan 0.05 if at least 10 students selected it; thekey category point-biserial correlation is lowerthan 0.25; the categories abilities for partialcredit items are not ordered; and/or the infitmean square is higher than 1.20 or lower than0.8. An example extract is shown in Figure 16.
International Calibration
International item parameters were set byapplying the conditional item response model (9)in conjunction with the multivariate populationmodel (15), without using conditioningvariables, to a sub-sample of students. This sub-sample of students, referred to as theinternational calibration sample, consisted of
Scal
ing
PIS
A c
ogn
itiv
e d
ata
105
Item by Country Interactions Discrimination Fit
Largelow discr.
item
Smallhigh discr.
item
Ability notOrdered
Key Point-Biserial
Correlation isNegative
Non-keyPoint-BiserialCorrelation is
Positive
Harderthan
Expected
Easierthan
Expected
M033Q01
M037Q02T
M124Q01
13 500 students comprising 500 students drawnat random from each of the 27 participatingOECD countries that met the PISA 2000response rate standards (see Chapter 13). Thisexcluded any SE booklet students and wasstratified by linguistic community within thecountry.
The allocation of each PISA item to one of thefive PISA 2000 scales is given in Appendix 1 (forreading), Appendix 2 (for mathematics) andAppendix 3 (for science).
Student Score Generation
As with all item response scaling models, studentproficiencies (or measures) are not observed;they are missing data that must be inferred fromthe observed item responses. There are severalpossible alternative approaches for making thisinference. PISA used two approaches: maximumlikelihood, using Warm’s (1985) WeightedLikelihood Estimator (WLE), and plausiblevalues (PVs). The WLE proficiency makes theactual score that the student attained the mostlikely. PVs are a selection of likely proficienciesfor students that attained each score.
Computing Maximum LikelihoodEstimates in PISA
Six weighted likelihood estimates were providedfor each student, one each for mathematicalliteracy, reading literacy and scientific literacyand one for each of the three reading literacy,
Figure 16: Example of Item Statistics Shown in Report 5
sub-scales. These can be treated as (essentially) unbiased estimates of student abilities, and analysed usingstandard methods.
Weighted maximum likelihood ability estimates (Warm, 1985) are produced by maximising (9) withrespect to , that is, solving the likelihood equations:
(18)
for each case, where are the item parameter estimates obtained from the international calibration andd indicates the latent dimensions. is the test information for student n on dimension d, and isthe first derivative with respect to θ. These equations are solved using a routine based on the Newton-Raphson method.
Plausible Values
Using item parameters anchored at their estimated values from the international calibration, the plausiblevalues are random draws from the marginal posterior of the latent distribution, (15), for each student.For details on the uses of plausible values, see Mislevy (1991) and Mislevy et al. (1992).
In PISA, the random draws from the marginal posterior distribution (17) are taken as follows.
M vector-valued random deviates, , from the multivariate normal distribution, ,
for each case n.3 These vectors are used to approximate the integral in the denominator of (17), using theMonte-Carlo integration:
(19)
At the same time, the values:
(20)
are calculated, yielding the set of pairs , which can be used as an approximation of
the posterior density (17); and the probability that could be drawn from this density is given by:
(21)
At this point, L uniformly distributed random numbers are generated; and for each randomdraw, the vector, , that satisfies the condition:
(22)
is selected as a plausible vector.
q qsns
i
i sns
i
=
−
=∑ ∑< ≤
1
1
1
0 0
η
ϕϕni0
η i i
L =1
qp
pnj
mn
mnm
M=
=∑
1
.
ϕϕnj
ϕϕmnmn
m
Mp, ℑ
=1
p f fmn n mn mn n= ( ) ( )x x W; | ; , ,ξξ ϕϕ ϕϕ γγ ΣΣθθ
f f dM
f mnm
M
x xx x; | , ( ; | ) .ξξ θθ θθ,, γγ ΣΣ θθ ξξ ϕϕθθθθ
( ) ( ) ≈ ≡ ℑ∫ ∑=
1
1
f n nθθ θθ ;; ,, γγ ΣΣW ,( )ϕϕmn m
M =1
JndInd
ξξ
bixni−
bij exp bij sndθnd + ′a ij ξ( )exp biksndθnd + ′a ikξ( )
k =1
Ki
∑j =1
K i
∑ +J nd
2I nd
i∈Ω∑
d∈ D∑ = 0,
θθn
PIS
A 2
00
0 t
ech
nic
al r
epor
t
106
3 The value M should be large. In PISA, M = 2000 was used.
Constructing ConditioningVariables
The PISA conditioning variables are preparedusing procedures based on those used in theUnited States National Assessment ofEducational Progress (Beaton, 1987) and inTIMSS (Macaskill, Adams and Wu, 1998). Thesteps involved in this process are as follows:
• Step 1. Each variable in the StudentQuestionnaire was dummy coded according tothe coding presented in Appendix 8.4
• Step 2. For each country, a principalcomponents analysis of the dummy-codedvariables was performed, and componentscores were produced for each student (asufficient number of components to accountfor 90 per cent of the variance in the originalvariables).
• Step 3. Using item parameters anchored attheir international location and conditioningvariables derived from the national principalcomponents analysis, the item-response modelwas fit to each national data set and thenational population parameters γ and Σ wereestimated.5
• Step 4. Five vectors of plausible values weredrawn using the method described above.
Analysis of Data withPlausible Values
It is very important to recognise that plausiblevalues are not test scores and should not betreated as such. They are random numbersdrawn from the distribution of scores that couldbe reasonably assigned to each individual—thatis, the marginal posterior distribution (17). Assuch, plausible values contain random errorvariance components and are not optimal asscores for individuals. 6 Plausible values as a setare better suited to describing the performanceof the population. This approach, developed byMislevy and Sheehan (1987, 1989) and based onthe imputation theory of Rubin (1987), produces
Scal
ing
PIS
A c
ogn
itiv
e d
ata
107
consistent estimators of population parameters.Plausible values are intermediate values
provided to obtain consistent estimates ofpopulation parameters using standard statisticalanalysis software such as SPSS® and SAS®. Asan alternative, analyses can be completed usingConQuest (Wu, Adams and Wilson, 1997).
The PISA student file contains 30 plausiblevalues, five for each of the five PISA 2000cognitive scales and five for the combinedreading scale. PV1MATH to PV5MATH are fivefor mathematical literacy; PV1SCIE to PV5SCIEfor scientific literacy, and PV1READ toPV5READ for combined reading literacy. Forthe three reading literacy sub-scales, retrievinginformation, interpreting texts and reflection andevaluation, the plausible values variables arePV1READ1 to PV5READ1, PV1READ2 toPV5READ2 and PV1READ3 to PV5READ3,respectively.
If an analysis were to be undertaken with oneof these five cognitive scales or for the combinedreading scale, then it would ideally be undertakenfive times, once with each relevant plausiblevalues variable. The results would be averaged,and then significance tests adjusting for variationbetween the five sets of results computed.
More formally, suppose that is astatistic that depends upon the latent variableand some other observed characteristic of eachstudent. That is: where are the values of the latentvariable and the other observed characteristic forstudent n. Unfortunately θn is not observed.However, the item responses, xn, from which themarginal posterior can beconstructed for each student n, are observed. If
is the joint marginal posteriorfor n=1,...,N then:
(23)
can be computed.The integral in (23) can be computed using
the Monte-Carlo method. If M random vectorsare drawn from
(23) is approximated by:
, (24)
r* X, Y( ) ≈ 1
Mr θm , Y( )
m =1
M
∑
=1
Mrm
m =1
M
∑
hθθ θθ ξξ γγ ΣΣ; , , , |Y X( ) Θ Θ Θ1 2, , ,K M( )
r E r
r h d
* *
; , , , |
X,Y ,Y X,Y
,Y Y X
( ) = ( )[ ]= ( ) ( )∫
θθ
θθ θθ ξξ γγ ΣΣ θθθθθθ
hθθ θθ ξξ γγ ΣΣ; , , , |Y X( )
h yn n nθθ θθ ξξ γγ ΣΣ; , , , | x( )
θn ny,( )θθ, , , , ,..., ,Y( ) = ( )θ θ θ1 1 2 2y y yN N
r θθ,Y( )
4 With the exception of gender and ISEI.5 In addition to the principal components, gender,ISEI and school mean performance were added asconditioning variables.6 Where optimal might be defined, for example, aseither unbiased or minimising the mean squared errorat the student level.
where is the estimate of r computed usingthe m-th set of plausible values.
From (23) we can see that the final estimateof r is the average of the estimates computedusing each plausible value in turn. If Um is thesampling variance for then the samplingvariance of is:
, (25)
where and
.
An α-% confidence interval for is:
where is the s
percentile of the t-distribution with υ degrees of
freedom. ,
and d is the degrees of
freedom that would have applied had θn beenobserved. In PISA, d will vary by country andhave a maximum possible value of 80.
f M B VM M= +( )−1 1
υ =
−+
−( )1
1
12 2f
M
f
dM M
t sυ ( )r t V* ± −( )
υ
α12
12
r*
BM
r rM mm
M
=−
−( )=
∑1
1
2
1
ˆ *
UM
Umm
M* =
=∑1
1
V U M BM= + +( )−* 1 1
r*rm
rm
PIS
A 2
00
0 t
ech
nic
al r
epor
t
108
Coding and MarkerReliability Studies
As explained in the first section of this report,on Test Design (see Chapter 2), a substantialproportion of the PISA 2000 items were open-ended and required marking by trained markers(or coders). It was important therefore that PISAimplemented procedures that maximised thevalidity and consistency (both within andbetween countries) of this marking.
Each country coded items on the basis ofMarking Guides prepared by the Consortium(see Chapter 2) using the marking designdescribed in Chapter 6. Training sessions to traincountries in the use of the Marking Guides wereheld prior to both the field trial and the mainstudy.
This chapter describes three aspects of thecoding and marking reliability studiesundertaken in conjunction with the field trialand the main study. These are the homogeneityanalyses undertaken with the field trial data toassist the test developers in constructing valid,reliable scoring rubrics; the variance componentanalyses undertaken with the main study data toexamine within-country rater reliability; and anInter-country Reliability Study undertaken toexamine the between-country consistency inapplying the Marking Guides.
Examining Within-country Variability inMarking
Norman Verhelst
To obtain an estimate of the between-markervariability within each country, multiple markingwas required for at least some student answers.
Therefore, it was decided that multiple markingswould be collected for all open-ended items inboth the field trial and the main study for amoderate number of students. In the main study,either 48 or 72 students’ booklets were multiplymarked, depending on the country. Therequirement was that the same four expertmarkers per domain (reading, mathematics andscience) should mark all items appearingtogether in a test booklet. A booklet containing,for example, 15 reading items, would give athree-dimensional table for reading (48 or 72students by 15 items by 4 markers), where eachcell contains a single category. For each domainand each booklet, such a table was producedand processed in several analyses, which aredescribed later. These data sets were requiredfrom each participating country.
The field trial problems were quite differentfrom those in the main study. In the field trial,many more items were tried than were used inthe main study. One important purpose of thefield trial was to select a subset of items to beused in the main study. One obvious concernwas to ensure that markers agreed to areasonable degree in their categorisation of theanswers. More subtle problems can arise,however. In the final administration of a test, astudent answer is scored numerically. But in theconstruction phase of an item, more than tworesponse categories may be provided, say A, Band C, and it may not always be clear how theseshould be converted to numerical scores. Thetechnique used to analyse the field trial data canprovide at least a partial answer, and also givean indication of the agreement between markers
Chapter 10
treatment of homogeneity analysis, seeNishisato, 1980; Gifi, 1990; or Greenacre,1984.) Although observations may be coded asdigits, these digits are considered as labels, notnumbers. To have a consistent notational system,it is assumed in the sequel that the responsecategories of item i are labelled . Themain purpose of the analysis is to convert thesequalitative observations into (quantitative) datawhich are in some sense optimal.
The basic loss functionThe first step in the analysis is to define a set of(binary) indicator variables that contain all theinformation of the original observations, definedby:
, (26)
where it is to be understood that can takeonly the values 0 and 1.
The basic principle of homogeneity analysis isto assign a number to each student (thestudent score on item i), and a number toeach observation , called the categoryquantification, such that student scores are insome way the best summary of all categoryquantifications that apply to them. Tounderstand this in more detail, consider thefollowing loss function:
. (27) F g x yi ivm iv im
mv
= −∑∑∑ l l
l
( )2
Oivm
yiml
xiv
givml
O givm vim= ⇔ =l l 1
1, , , ,K lK Li
for each item separately. The technique is calledhomogeneity analysis. It is important to notethat the data set for this analysis is treated as acollection of nominal or qualitative variables.
In the main study, the problem was different.The field trial concluded with a selection of adefinite subset of items and a scoring rule foreach. The main problem in the main study wasto determine how much of the total variance ofthe numerical test scores could be attributed tovariability across markers. The basic data set tobe analysed therefore consisted of a three-dimensional table of numerical scores. Thetechnique used is referred to as variancecomponent analysis.
Some items in the field trial appeared tofunction poorly because of well-identified defects(e.g., poor translation). To get an impression ofthe differences in marker variability betweenfield trial and main study, most field trial dataanalyses were repeated using the main studydata. Some comparisons are reported below.
This chapter uses a consistent notationalsystem, summarised in Figure 17.
Nested data structures are occasionallyreferred to, as every student and every markerbelong to a single country. In such cases, theindices m and v take the subscript c.
Homogeneity Analysis
In the analysis, the basic observation is thecategory into which marker m places theresponse of student v on item i, denoted Oivm.Basic in the approach of homogeneity analysis isto consider observations as qualitative ornominal variables. (For a more mathematical
PIS
A 2
00
0 t
ech
nic
al r
epor
t
110
Symbol Range Meaning
i 1,...,I item
c 1,...,C country
Vc Number of students from country c
v 1,..., Vc student
Mc Number of markers from country c
m 1,...,M= marker
1,...,Li category of item i
Figure 17: Notational System
l
M c∑
The data are represented by indicator variables. If marker m has a good idea of the
potential of student v, and thinks it appropriateto assign him/her to category , then, ideally,one would expect that , yielding a zeroloss for that case. But the same marker can haveused the same category for student , who hasa different potential ( ), and since thequantification is unique, there cannot be azero loss in both cases. Thus, some kind ofcompromise is required, which is made byminimising the loss function Fi.
Four observations have to be made inconnection with this minimisation: • The loss function (27) certainly has no unique
minimum, because adding an arbitraryconstant to all and all leaves Fiunaltered. This means that in some way anorigin of the scale must be chosen. Althoughthis origin is arbitrary, there are sometheoretical advantages in defining it throughthe equality:
. (28)
• If for all v and all m and one choosesthen Fi = 0 (and (28) is fulfilled),
which is certainly a minimum. Such asolution, where all variability in the studentscores and in the category quantifications issuppressed, is called a degenerate solution. Toavoid such degeneracy, and at the same timeto choose a unit of the scale, requires therestriction:
. (29)
Restrictions (28) and (29) jointly guarantee thata unique minimum of Fi exists and correspondsto a non-degenerate solution except in somespecial cases, as discussed below.
• Notice that in the loss function, missingobservations are taken into account in anappropriate way. From definition (26) itfollows that if is missing, = 0 for all
, such that a missing observation nevercontributes to a positive loss.
• A distinct loss function is minimised for eachitem. Although other approaches tohomogeneity analysis are possible, the presentone serves the purpose of item analysis well.The data pertaining to a single item areanalysed separately, requiring no assumptions
on their relationships. A later subsectionshows how to combine these separate analysesto compare markers and countries. Another way to look at homogeneity analysis
is to arrange the basic observations (for a singleitem i) in a table with rows corresponding tostudents and columns corresponding to markers,as in the left panel of Figure 18. The results canbe considered as a transformation of theobservations into numbers, as shown in the rightpanel of the figure. At the minimum of the lossfunction, the quantified observations havethe following interesting properties. The totalvariance can be partitioned into three parts: onepart attributable to the columns (the markers),another part attributable to the rows (thestudents), and a residual variance. At thesolution point it holds that:
(30)
If the markers agree well among themselves,Var(residuals) will be a small proportion of thetotal variance, meaning that the markers arevery homogeneous. The index of homogeneity isdefined therefore as:
, (31)
at the point where Fi attains its minimum. Thesubscript c has been added to indicate that thisindex of homogeneity can only be meaningfullycomputed within a single country, as explainedin the next sub-section.
HVar students
Var students Var residualsic =( )
( ) + ( )
Var students is
Var
Var residuals is
( )( ) =( )
maximised
markers
minimised
0,
yiml
l givmlOivm
112
Vxiv
v
V
∑ =
x yiv im= =l 0 l
xivv
=∑ 0
yimlxiv
yiml
xiv'
′υ
x yiv im= l
l
givml
Cod
ing
and
mar
ker
reli
abil
ity
stu
die
s
111
1... m ...M 1... m ...M
1 1
v v
V V
Figure 18: Quantification of Categories
L L L L L L
M M M M M M M M
L yiml L→ L Oivm = l L
M M M M M M M M
L L L L L L
.
,
The indices Hic can be compared meaningfullywith each other, and across countries and items,because they are all proportions of varianceattributable to the same source (students),compared to the total variance attributable tostudents and markers. The differences betweenthe Hic-indices, therefore, must be attributed tothe items, and can therefore be used as aninstrument for item analysis. Items with a highHic-index are less susceptible to markervariation, and therefore the scores obtained onthem are more easily generalisable acrossmarkers.
Degenerate and quasi-degenerate solutionsAlthough restriction (29) was introduced toavoid degenerate solutions, it is not alwayssufficient, and the data collection design or somepeculiarities in the collected data can lead toother kinds of degeneration. First, degeneracydue to the design is discussed.
Using the notational conventions explained inthe introduction, the loss function (27) can, inthe case of the data of all countries analysedjointly, be written as:
.(32)
To see the degeneracy clearly, suppose C = 2,V1 = V2 and there are no missing responses. Thevalue Fi = 0 (and thus Hi) can be reached asfollows: , and
. This solution complies with(28) and (29), but it can easily be seen that itdoes nothing else than maximise the variance ofthe x’s between countries and minimise thevariance within countries. Of course, one couldimpose a restriction analogous to (29) for eachcountry, but then (32) becomes C independentsums, and the scores (x-values) are no longercomparable across countries. A meaningfulcomparison across countries requires imposingrestrictions of another kind, described in thenext subsection.
In some cases, this kind of degeneracy mayoccur also in data sets collected in a completedesign, but with extreme patterns of missingobservations. Suppose some student has got acode from only one marker and assume,moreover, that this marker used this code only
once. A degenerate solution may then occurwhere this student is contrasted with all others(collapsed into a single point), much in the sameway as in the example above.
Similar cases may occur when a certain code, say, is used a very few times. Assume this
code is used only once, by marker m. Bychoosing a very extreme value for , asituation may occur where the student givencode by marker m tends to contrast with allthe others, although fully collapsing them maybe avoided (because this student’s score is‘pulled’ towards the others by the codes receivedfrom the other markers). But the general resultwill be one where the solution is dominated bythis very infrequent coding by one marker. Suchcases may be called quasi-degenerate and areexamples of chance capitalisation. They areprone to occur in small samples of students,especially in cases where there are manydifferent categories as in the field trial,especially with items with multiple-answercategories. Cases of quasi-degeneracy give aspuriously high Hi-index, and one should be verycareful not to cherish such an item too much,because it might show very poor performance ina cross-validation.
Quasi-degeneracy is an intuitive notion that isnot rigorously defined, and will continue to be amajor source of concern, although adequaterestrictions on the model parameters usuallyaddress the problem.
To develop good guidelines for selecting agood test from the many items used in the fieldtrial, one should realise that a low homogeneityindex points to items that will introduceconsiderable variability into the test scorebecause of rater variance, and may therefore bestbe excluded from the definitive test. But an itemwith a high index is not necessarily a good item.Quasi-degeneracy will tend to occur in caseswhere one or more response categories are usedvery infrequently. It might therefore be useful todevelop a device that can simultaneously judgehomogeneity and the risk of quasi-degeneracy.
Homogeneity Analysis with Restrictions
Apart from cases of quasi-degeneracy, there isanother reason for imposing restrictions on themodel parameters in homogeneity analysis. Thespecific value of Hi obtained from the
l
yimcl
l
y xim ivc cl l= (all )
x civc= − =1 2 if x civc
= =1 1 if
F g x yi iv m iv im
L
m
M
v
V
c
C
c c c c
i
c
c
c
c
= −∑∑∑∑ l l
l
( )2
PIS
A 2
00
0 t
ech
nic
al r
epor
t
112
minimisation of (27) can only be attained if thequantification issued from the homogeneityanalysis is indeed used in applications, i.e., whenthe score obtained by student v on item i whencategorised by marker m in category is equalto the category quantification . But thismeans that the number of points to be earnedfrom receiving category may differ acrossmarkers. An extra difficulty arises when newmarkers are used in future applications. Thiswould imply that in every application a newhomogeneity analysis has to be completed. Andthis is not very attractive for a project like PISA,where the field trial is meant to determine a(more or less) definite scoring rule, whereby thesame scoring applies across all raters and countries.
Restrictions within countriesAs a first restriction, one might wish for novariation across markers within the same countryso that for each country c the restriction:
(33)
is imposed. But since students and markers arenested within countries, this amounts tominimising the loss function for each country:
, (34)
such that the overall loss function is minimisedautomatically to:
. (35)
To minimise (34), technical restrictions similarto (28) and (29) must hold per country.
As in the case without restrictions,homogeneity indices can be computed in thiscase also (see equation (31)), and will be denoted
. It is easy to understand that for all itemsand all countries the inequality:
(36)
must hold.In contrast to some indices to be discussed
further, -indices are not systematicallyinfluenced by the numbers of students or raters.If students and raters within a country can beconsidered to be a random sample of somepopulations of students and markers, the
computed indices are a consistent (and unbiased)estimate of their population counterparts. Thenumber of students and markers influence onlythe standard error.
The -indices can be used for a doublepurpose:• Comparing with Hic within countries: if
for some item, is much lower than Hic,this may point to systematic differences ininterpretation of the Marking Guides amongmarkers. Suppose, as an example, a binaryitem with categories A and B, and all markersin a country except one agree perfectly amongthemselves in their assignment of students tothese categories; one marker, however,disagrees perfectly with the others in assigningto category A where the others choose B, andvice versa. Since each marker partitions allstudents in the same two subsets, the index ofhomogeneity Hic for this item will be one, butthe category quantifications of the outlyingmarker will be different from those of theother markers. Requiring that the categoryquantifications be the same for all markerswill force some compromise, and the resulting
will be lower than one.• Differences between and Hic can also be
used to compare different countries amongeach other (per item). Differences, especially ifthey are persistent across items, make itpossible to detect countries where the markingprocess reveals systematic disagreementamong the markers.
Restrictions within and across countriesOf course, different scoring rules for differentcountries are not acceptable for the main study.More restrictions are therefore needed toascertain that the same category shouldcorrespond to the same item score (quantification)in each country. This amounts to a furtherrestriction on (33), namely:
, (37)
leading to the loss function:
, (38)
and a corresponding index of homogeneity,denoted as .Hi
**
F g x yi iv m
L
m
M
v
V
c
C
iv ic c
i
c
c
c
c
c
** **( )= −∑∑∑∑ l
l
l2
y y c C Lic i il l K l K= = =, ( , , ; , , )1 1
Hic*
Hic*
Hic*
Hic*
Hic*
Hic*
H Hic ic* ≤
Hic*
F Fi icc
* *= ∑
F g x yic iv m iv
L
m
M
v
V
icc c c
i
c
c
c
c
* *( )= −∑∑∑ l
l
l2
y y m M Lim ic c icl l K l K= = =, ( , , , , , ) 1 1
l
yiml
l
Cod
ing
and
mar
ker
reli
abil
ity
stu
die
s
113
To provide an impression of the use of theseindices, a summary plot for the open-endeditems in science is displayed in Figure 19. Theline shown in the chart connects the -indicesfor all items (sorted in decreasing order). Aninterval is plotted, for each item, based on the
-indices of 24 countries. For these countries,the -indices were sorted in ascending orderand the endpoints of the intervals correspond tothe 6th and 18th values. This figure may be auseful guide for selecting items for the mainstudy since it allows a selection based on thevalue of , but also shows clear differences inthe variability of the -indices. The first threeitems, for example, have almost equal
-indices, but the second one shows morevariability across countries than the other two,and therefore may be less suitable for the mainstudy. But perhaps the clearest example is thelast one, with the lowest -index and showinglarge variations across countries.
An Additional Criterion forSelecting Items
An ideal situation (from the viewpoint ofstatistical stability of the estimates) occurs wheneach of the Li response categories of an item hasbeen used an equal number of times: for eachmarker separately when one does the analysiswithout restrictions; across markers within acountry when an analysis is done with restrictionswithin a country (restriction (33); or acrossmarkers and countries when restriction (38)applies. If the distribution of the categoriesdeparts strongly from uniformity, cases of quasi-degeneracy may occur: the contrast between asingle category with very small frequency and allthe other categories may tend to dominate thesolution. Very small frequencies are more likelyto occur in small samples than in large samples,hence the greater possibility of being influencedby chance. The most extreme case occurs whenthe whole distribution is concentrated in a single
Hi**
Hi**
Hic*
Hi**
Hic*
Hic*
Hi**
Figure 19: H* and H** for Science Items
114
0.1
0.3
0.5
0.7
0.9
1.1
Inde
x of
Hom
ogen
eity
PIS
A 2
00
0 t
ech
nic
al r
epor
t
Cod
ing
and
mar
ker
reli
abil
ity
stu
die
s
115
category; a homogeneity analysis thus has nomeaning (and technically cannot be carried outbecause normalisation is not defined).
From a practical point of view, an item with adistribution very close to the extreme case of novariability is of little use in testing, but may havea very acceptable index of homogeneity.Therefore it seems wise to judge the quality ofan item by considering simultaneously thehomogeneity index and the form of distribution.
To measure the departure from a uniformdistribution in the case of nominal variables, theindex to be developed must be invariant underpermutation of the categories. For a binary item,for example, the index for an item with p-valuep must be equal to that of an item with p-value1-p.
Pearson’s well known X2 statistic, computedwith all expected frequencies equal to each other,fulfils this requirement, and can be written informula form as:
, (39)
where, in the middle expression, is thefrequency of category , is the averagefrequency and L is the number of categories,and, in the right-hand expression, and
(=1/L) are observed and average proportion,and n is the sample size. This index, however,changes with changing sample size, andcomparison across items (even with constantsample size) is difficult because the maximumvalue increases with the number of categories. Itcan be shown that:
, (40)
where equality is reached only where L-1categories have frequency zero, i.e., the case ofno variability. The index:
(41)
is proposed as an index of departure fromuniformity. Its minimum is zero (uniformdistribution), its maximum is one (no variability).It is invariant under permutation of thecategories and is independent of the sample size.This means that it does not change when all
frequencies are multiplied by a positive constant.Using proportions instead of frequencies ,(41) can be written as:
. (42)
Table 24 gives some distributions and theirassociated values for L = 2 and L = 3. (Rowfrequencies always sum to 100.)
Table 24: Some Examples of
L = 2 L = 3
Frequency Frequency
50 50 0.00 50 50 0 0.25
60 40 0.04 40 30 30 0.01
70 30 0.16 50 25 25 0.06
75 25 0.25 50 48 2 0.22
80 20 0.36 70 15 15 0.30
90 10 0.64 70 28 2 0.35
95 5 0.81 80 10 10 0.49
99 1 0.96 80 18 2 0.51
As an example, the -indices are plotted inFigure 20 against the -values for the 22 open-ended reading items in Booklet 9 of the fieldtrial, using the Australian data. The binary (b)items are distinguished from the items with morethan two categories (p). One can see that mostof the items are situated in the lower right-handcorner of the figure, combining highhomogeneity with a response distribution thatdoes not deviate too far from a uniformdistribution. But some items have highhomogeneity indices and a very skeweddistribution, which may make them less suitablefor inclusion in the main study.
∆Hic
*
∆∆
∆
∆ =
−−∑L
Lp L1
1 2( )l
l
fl pl
∆ =−
X
n L
2
1( )
X n L2 1≤ −( )
p pl
f l fl
X
f f
fn
p p
p
L L2
2 2
=−
=−∑ ∑( ) ( )l
l
l
l
A Comparison Between Field Trial and Main Study
The main study used a selection of items fromthe field trial, but the wording in the item text ofthe Marking Guides was changed in some cases(mainly due to incorrect translations). Thechanged items were not tested in an independentfield trial, however. If they did really representan improvement, their homogeneity indicesshould rise in comparison with the field trial.
Another reason for repeating the homogeneityanalysis in the main study is the relative numbersof students and markers used for the reliabilitystudy in the field trial and in the main study.Small numbers easily give rise to chancecapitalisation, and therefore repeating thehomogeneity analyses in the main study serves asa cross-validation.
Figure 21 shows a scatter-plot of the -indices of the reading items in the Australianfield trial and main study samples, where itemsare distinguished by the reading subscale towhich they belong: 1 = retrieving information,
2 = interpreting texts and3 = reflection and evaluation.
As the dashed line in the figure indicatesequality, it is immediately clear that the -indices for the great majority of items haveincreased in the main study. Another interestingfeature is that the reflection items systematicallyshow the least homogeneity among markers.
Variance Component Analysis
The general approach to estimating thevariability in the scores due to markers isgeneralisability theory. Introductions to theapproach can be found in Cronbach, Gleser,Nanda and Rajaratnam (1972), and in Brennan(1992). In the present section, a shortintroduction is given of the general theory andthe common estimation methods. Ageneralisability coefficient is then derived, as aspecial correlation coefficient, and itsinterpretation is discussed. Finally some specialPISA-related estimation problems are discussed.
Hic*
Hic*
PIS
A 2
00
0 t
ech
nic
al r
epor
t
116
Figure 20: Homogeneity and Departure from Uniformity
0.3 0.5 0.7 0.9H*
0.0
0.2
0.4
0.6
0.8
Del
tabp
Analysis of two-way tablesTo make the notion of generalisability theory clear,a simple case where a number of students answera number of items, and for each answer they geta numerical score which will be denoted as , thesubscript v referring to the student, and the subscripti referring to the item, is described first. These ob-servations can be arranged in a rectangulartable or matrix, and the main purpose of theanalysis is to explain the variability in the table.
Conceptually, the model used to explain thevariability in the table is the following:
, (43)
where is an unknown constant, is the studenteffect, is the item effect, (to be read asa single symbol, not as a product) is the student-item interaction effect and is the measurementerror. The general approach in generalisabilitytheory is to assume that the students in the sampleare randomly drawn from some population, butalso that the items are randomly drawn from apopulation, usually called a universe. This meansthat the specific value can be considered as a
realisation of a random variable, α for example,and similarly for the item effect: is a realisationof the random variable β. Also the interaction effect
and the measurement error are con-sidered as realisations of random variables. So, themodel says that the observed score is a sum of anunknown constant µ and four random variables.
The model as given in (43), however is notsufficient to work with, for several reasons.First, since each student in the sample gives onlya single response to each item in the test, theinteraction effects and the measurement error areconfounded. This means that there is nopossibility to disentangle interaction andmeasurement error. Therefore, they will be takentogether as a single random variable, ε forexample, which is called the residual (and whichis definitely not the same as the measurementerror). This is defined as:
, (44)
and (43) can be rewritten as:
. (45)Yvi v i vi= + + +µ α β ε
ε αβ εvi vi vi= +( ) *
εvi*( )αβ vi
βi
α v
εvi*
( )αβ viβi
α vµ
Yvi v i vi vi= + + + +µ α β αβ ε( ) *
V I×
Yvi
Cod
ing
and
mar
ker
reli
abil
ity
stu
die
s
117
Figure 21: Comparison of Homogeneity Indices for Reading Items in the Main Study and the Field Trial in Australia
3
31
2 1
0.2 0.4 0.6 0.8 1.0
0.4
0.5
0.6
0.7
0.8
0.9
1.0
3
3
2
2
2
3
3
3
3
3
2
2
3
3
1
3
3
3
11
2
2
3
2
3
1
3
3
3
2
2
3
1
3
1
23
1
1
1
2
22
1
2
1
1
21
1
Mai
n St
udy
Field Trial
Second, since the right-hand side of (45) is asum of four terms, and only this sum isobserved, the terms themselves are not identified,and therefore three identification restrictionshave to be imposed. Suitable restrictions are:
. (46)
Third, apart from the preceding restriction,which is of a technical nature, there is oneimportant theoretical assumption: all the non-observed random variables (student effects, itemeffects and residuals) are mutually independent.This assumption leads directly to a rather simplevariance decomposition:
. (47)
The total variance can easily be estimatedfrom the observed data, as well as the constantµ. The first purpose of variance componentanalysis is to obtain an estimate of the threevariance components , and . If the datamatrix is complete, good estimators are given bythe traditional techniques of variance analysis,using the decomposition of the total sum ofsquares SStot as:
, (48)
where row refers to the students and col refersto the items. Dividing each SS by their respectivenumber of degrees of freedom yields thecorresponding so-called mean squares, fromwhich unbiased estimates of the three unknownvariance components can be derived:
, (49)
, (50)
and
. (51)
Usually the exact value of the three variancecomponents will be of little use, but their relativecontribution to the total variance is. Thereforethe variance components will be expressed as apercentage of the total variance (the sum of thecomponents) in what follows.
σ β2 =
−MS MS
Vcol res
σα2 =
−MS MS
Irow res
σε2 = MSres
SS SS SS SStot row col res= + +
σε2σ β
2σα2
σ Y2
σ σ σ σα β εY2 2 2 2= + +
E E E( ) ( ) ( )α β ε= = = 0
PIS
A 2
00
0 t
ech
nic
al r
epor
t
118
The estimators given by (49) through (51)have the attractive property that they areunbiased, but they also have an unattractiveproperty: the results of the formulae in (50) and(51) can be negative. In practice, this seldomoccurs, and if it does, it is common to changethe negative estimates to zero.
Analysis of three-way tablesThe approach for three-way tables is astraightforward generalisation of the case oftwo-way tables. The observed data are nowrepresented by , the score student v gets forhis/her answer on item i when marked bymarker m. The observed data are arranged in athree-dimensional array (a box), where thestudent dimension will be denoted as rows, theitem dimensions as columns and the markerdimensions as layers.
The model is a generalisation of model (45):
(52)
The observed variable is the sum of aconstant, three main effects, three first-orderinteractions and a residual. The residual in thiscase is the sum of the second-order interaction
and the measurement error . Botheffects are confounded because there is only oneobservation in each cell of the three-dimensionaldata array. The same restrictions as in the caseof a two-way table apply: zero mean of the effectsand mutual independence. Therefore the totalvariance decomposes into seven components:
(53)
and each component can be estimated withtechniques similar to those demonstrated in thecase of a two-way table. (The formulae are notdisplayed.)
The risk of ending up with negative estimatesis usually greater for a three-dimensional tablethan for a two-way table. This will be illustratedby one case. The main effect reflects therelative leniency of marker m: marker m isrelatively mild if the effect is positive; relativelystrict if it is negative. It is not too unrealistic toassume that markers differ systematically inmildness, meaning that the variance component
γm
σ σ σ σ
σ σ σ σα β γ
αβ αγ βγ ε
Y2 2 2 2
2 2 2 2
= + +
+ + + +
εvim*( )αβγ vim
Yvim
Yvim v i m
vi vm im vim
= + + ++ + + +
µ α β γαβ αγ βγ ε( ) ( ) ( )
Yvim
.
,
Cod
ing
and
mar
ker
reli
abil
ity
stu
die
s
119
will differ markedly from zero, andconsequently that its estimator will have a verysmall probability of yielding a negative estimate.A positive interaction effect means thatmarker m is especially mild for student v (morethan on average towards the other students),reflecting in some way a positive or negative biasto some students. But if the marking procedurehas been seriously implemented—students notbeing known to the markers—it is to beexpected that these effects will be very small ifthey do exist at all. And this means that thecorresponding variance component will be closeto zero, making a negative estimate quite likely.
CorrelationsIf one wants to determine the reliability of a test,one of the standard procedures is to administer thetest a second time (under identical circumstances),and compute the correlation between the two setsof test scores. This correlation is the reliability ofthe test (by definition). But if all variance com-ponents are known, this correlation can be com-puted from them. This is illustrated here for thecase of a two-way table. To make derivations easy,the relative test scores are used, defined as:
. (54)
Using model (43) and substituting in (54):
. (55)
If the test is administered a second time usingthe same students and the same items, the relativescores on the repeated test will of course havethe same structure as the right-hand side of (55), and, moreover, all terms will be identicalwith the exception of the last one, because thetest replication is assumed to be independent.
To compute the covariance between the two setsof test scores, observe that the mean item effectsin both cases are the same for all students, meaningthat this average item effect does not contributeto the covariance or to the variance. And becausethe measurement error is independent in the twoadministrations, it does not contribute to thecovariance. So the covariance between the twoseries of test scores is:
. (56)
The variance of the test scores is equal in thetwo administrations:
. (57)
Dividing the right-hand sides of (56) and (57)gives the correlation:
(58)
But this correlation cannot be computed froma two-way table, because the interactioncomponent cannot be estimated. It iscommon to drop this term from the numeratorsin (58), giving as a result Cronbach’s alphacoefficient (Sirotnik, 1970):
(59)
where the equality holds if and only if theinteraction component is zero.
Another interesting question concerns thecorrelation one would find if on the two testoccasions an independent set of I items (bothrandomly drawn from the same universe) werepresented to the same students. In this case, themean item effect will contribute to variance ofthe test scores, but not to the covariance, andthe correlation will be given by:
, (60)
which of course cannot be computed eitherbecause interaction and measurement error areconfounded. Dropping the interactioncomponent from the numerators gives:
. (61)ρσ
σσ σ
α
αβ ε
2
2
22 2( , ). .
'Y Y
I
v v ≥
++
ρσ
σ
σσ σ σ
ααβ
αβ αβ ε
2
22
22 2 2( , ). .
'
*
Y Y I
I
v v =+
++ +
σαβ2
alpha =
++
=+
≤
σ
σσ σ
σ
σσ
ρ
α
ααβ ε
α
αε
2
22 2
2
22
1
*
. .'( , )
I
I
Y Yv v
σαβ2
ρ1 (Yv. ,Yv.' ) =
σα2 +
σαβ2
I
σα2 +
σαβ2 + σε*
2
I
Var Y Var YIv v( ) ( ). .
' *= = ++
σσ σ
ααβ ε22 2
Cov Y YIv v( , ). .
' = +σσ
ααβ22
YI I Iv v i
ivi
ivi
i.
*( )= + + + +∑ ∑ ∑µ α β αβ ε1 1 1
YI
Yv vii
. = ∑1
Yv.
γ vm
σ γ2
.
.
In generalisability theory, Cronbach’s alpha isalso called a generalisability coefficient forrelative decisions, while the right-hand side of (61)is called the generalisability coefficient forabsolute decisions.
Now the more interesting question of con-structing a sensible correlation coefficient for theproblem in the PISA study, where there are differentsources of variation—three main effects, threefirst-order interactions and a residual effect—isaddressed. The basic problem is to determine theeffects associated with the markers, or in otherwords what would be the correlation betweentwo series of test scores, observed with the samestudents and the same items but at each replicationmarked by an independent set of M markers,randomly drawn from the universe of markers.The relative test score is now defined as:
. (62)
Of the eight terms on the right-hand side of(52), some are to be treated as constant, somecontribute to the score variance and some to thecovariance and the variance, as displayed inTable 25.
Table 25: Contribution of Model Terms to(Co)Variance
constant
variance and covariance
variance
Using this table, and taking the definition ofthe relative score (62) into account, it is not toodifficult to derive the correlation between thetwo series of test scores:
. (63)
The expression on the right-hand side can beused as a generic formula for computing thecorrelation with an arbitrary number of items,and an arbitrary number of markers.
Estimation with missing data and incomplete designs
The estimation of the variance components withcomplete data arrays is an easy task to carry out.With incomplete designs and missing data,however, the estimation procedure becomesrather complicated, and good software to carryout the estimations is scarce. The only commercialpackage known to the author that can handlesuch problems is BMDP (1992), but the numberof cases it can handle is very limited, and it wasfound to be inadequate for the PISA analyses.
The next section proceeds in three steps. First, theprecise structure of the data that were available foranalysis is discussed; second, parameter estimationin cases where some observations are missing atrandom is addressed; and finally, the estimationprocedure in incomplete designs is described.
The structure of the dataIn the data collection of the main study, itemswere included in three domains of performance:mathematics, science and reading. For reportingpurposes, however, it was decided that three sub-domains in reading would be distinguished:retrieving information, interpreting texts andreflection and evaluation. In the present sectionthese three sub-domains are treated separately,yielding five domains in total.
Data were collected in an incomplete design,using nine different test booklets. One bookletcontained only reading items, while the othershad a mixed content (see Chapter 2).Themultiple marking scheme was designed to havefour markers per booklet−domain combination.For example, four markers did the multiplemarking on all open-ended reading items inBooklet 1, and four markers completed the samefor Booklet 2, but there were no restrictions onthe set of markers for different booklets. Therefore,it could happen that in some countries the set ofreading markers was the same for all booklets,while in other countries different sets, with someor no overlapping, were used for the marking inthe three domains. During the marking processno distinction was made as to the three sub-domains of reading: a marker processed all open-ended reading items appearing in a booklet.1
ρ
σσ
σσ σ σ σ σ
ααβ
ααβ γ αγ βγ ε
3
22
22 2 2 2 2
( , ).. ..'Y Y
I
I M I M
v v =
+
+ ++
++
×
γ αγ βγ ε, ( ), ( ),
α αβ, ( )
µ β,
YI M
Yv vimmi
.. =× ∑∑1
PIS
A 2
00
0 t
ech
nic
al r
epor
t
120
1 See Chapter 6 for a full description of the design.
This marking scheme is not optimal for avariance component analysis, but on top of thatthere were two other complications:• If the marking instructions were followed, we
would have a three-dimensional array perbooklet−domain combination, completelyfilled with scores, and such an array was thefirst unit of analysis in estimating the variancecomponents. In many instances, however,scores were not available in some cells ofthese arrays, causing serious estimationproblems. The way these problems weretackled is explained in the discussion of thesecond step: estimation with missingobservations.
• In some countries, the instruction to use fourmarkers per booklet, thus guaranteeing acomplete data array per booklet−domaincombination, was not followed rigorously. Insome instances more markers were used, eachof them doing part of the work in a ratheruncontrolled way. This caused seriousproblems in the estimation procedure of thevariance components, as explained in thediscussion of the third step: estimation inincomplete designs.
Estimation with missing observationsTo demonstrate the problems in the estimationof variance components, the analysis of a two-way table is used as an example, starting withthe definition of an indicator variable :
, (64)
and defining the number of observations perstudent v and per item i as:
(65)
, (66)
and the total number of observations as:
. (67)
For averages, the dot notation will be used:
, (68)
(69)
and
(70)
All averages are weighted averages, and sincethe weights at the single observation level areeither one or zero, the value of is arbitrary if
is zero.Of course the following can always be
written:
(71)
and for the left-hand expression and each of theexpressions between parentheses on the right-hand side of (71), one can define the weightedsum of squares as:
, (72)
, (73)
, (74)
and
. (75)
But a very nice property, which holds in thecase of complete data, is lost in general. It doesnot hold in general that:
.The estimation procedure in common analysis
of variance proceeds by equating the observedsum of squares (or mean squares) to theirexpected value; so the estimation procedure is amoment estimation. Denote with p one of theelements . With complete data,it turns out that:
, (76)E SS A B Rp p p p( ) = + +σ σ σα β ε2 2 2
tot row col res, , ,
SS SS SS SStot row col res= + +
SS u Y Y Y Yres vi vi v iiv
= − − +∑∑ ( ). . ..2
SS n Y Ycol i ii
= −∑ ( ). ..2
SS t Y Yrow v vv
= −∑ ( ). ..2
SS u Y Ytot vi viiv
= −∑∑ ( )..2
Y Y Y Y Y Y
Y Y Y Yvi v i
vi v i
− = − + −+ − − +
.. . .. . ..
. . ..
( ) ( )
( )
uvi
Yvi
YN
u Y
Nt Y
Nn Y
vi viiv
v vv
i ii
..
.
.
=
=
=
∑∑
∑
∑
1
1
1
Yn
u Yii
vi viv
. = ∑1
Yt
u Yvv
vi vii
. = ∑1
N t nv iiv
= =∑∑
n ui viv
= ∑
t uv vii
= ∑
uY
vivi=
1 if is observed
0 otherwise
uvi
Cod
ing
and
mar
ker
reli
abil
ity
stu
die
s
121
.
,
where the coefficients are verysimple functions of the number of students andthe number of items. The expected valueoperator also has a straightforward meaning,since the sampled students and items areconsidered as a random sample from theirrespective universes.
With missing data, the situation is morecomplicated. The indicator variables are to beconsidered as random variables, and in order todefine expected values of sums of squares, difficultproblems have to be solved, as can easily be seenfrom equation (72), where products of u and Y-variables appear, so that assumptions about thejoint distribution of both kinds of variables haveto be made. But the assumptions in classicalvariance component theory about the Y-variablesare very weak. In fact, the only assumption is thatrow effects, column effects and residuals have afinite variance. Without adding very specificassumptions about the Y-variables, the onlyworkable assumption about the joint distributionof Y and u is that they are independent, or that themissing observations are missing completely atrandom. All analyses carried out on the multiple-
marking data of the main study of PISA make thisassumption. The consequences of the violation ofthese assumptions are outlined below.
With this assumption, the expected valueoperator in (76) is well defined and the generalform of (76) is also valid with missing data,although the coefficients aremore complicated than in the complete case. Thecase of a two-way analysis is given in Table 26.
In the complete case, it holds thatsuch that the coefficients reduce
to a much simpler form, as displayed in Table 27.In Table 27, it is easily checked that the sum of
the last three rows yields the coefficients for. This means that any three of the four
observed sums of squares can be used to solve thethree unknown variance components, and eachsystem will lead to the same solution. With missingdata, however, this feature is lost, and each systemof three equations using the coefficients in Table 26will lead to a different solution, showing thatmoment estimators are not unique. If the modelassumptions are fulfilled, however, the differencesbetween the solutions are reasonably small. (SeeVerhelst (2000) for more details.)
E SStot( )
t I n Vv i= = and
A B Rp p p, and
uvi
A B Rp p p, and
PIS
A 2
00
0 t
ech
nic
al r
epor
t
122
Table 26: Coefficients of the Three Variance Components (General Case)
Student Component Item Component Measurement Error Component
( ) ( ) ( )
N V Iu
t nvi
v iiv
− − − + ∑∑1 2Vn
N
ii−∑ 2
It
N
vv−∑ 2
E SSres( )
I −1Nn
N
ii−∑ 2
It
N
vv−∑ 2
E SScol( )
V −1Vn
N
ii−∑ 2
Nt
N
vv−∑ 2
E SSrow( )
N −1Nn
N
ii−∑ 2
Nt
N
vv−∑ 2
E SStot( )
σε2σ β
2σα2
Cod
ing
and
mar
ker
reli
abil
ity
stu
die
s
123
For three-way tables, a similar reasoning canbe followed, yielding coefficients for the expectedsums of squares for the total, the three maineffects, the three first-order interactions and theresidual; and any subset of seven equations canbe used for a moment estimator of the sevenvariance components. The expressions for thecomponents, however, are all more complicatedthan in the case of two-way tables and are notpresented here. For details, see Longford (1995)and Verhelst (2000).
An extensive simulation study (Verhelst, 2000)has shown that the variance components can beconsistently estimated if indeed the missing valuesare missing completely at random, but if they arenot, the estimation method may produce totallyunacceptable estimates. As an example, themarking procedure for the reading items inBooklet 1 in the United States was examined.This booklet contains 22 open-ended readingitems, and multiple markings were collected for48 students. Following the marking instructions,four markers had to mark all 22 answers for all48 students, yielding 48 × 22 = 1 056 markingsper marker. In the data set, however, ninedifferent marker identifications were found, withthe number of markings per marker rangingfrom 129 to 618. The design with which themarkings were collected is unknown, and canonly be partially reconstructed from the data. InTable 28, the variance component estimates forthe five ‘retrieving information’ open-endeditems in Booklet 1 arising from the abovemethod analysis are reported. The large negativeestimate for the student−marker interactioncomponent is totally unacceptable, and the useof the estimation method in a case such as this isdubious.
Table 28: Estimates of Variance Components (× 100) in the United States
Student Component ( ) 9.8
Item Component ( ) 5.1
Marker Component ( ) 0.4
Student-Item Interaction Component ( ) 10.0
Student-Marker Interaction Component ( ) -12.0
Item-Marker Interaction Component ( ) -0.3
Measurement Error Component ( ) 13.7
The third step in the procedure is the combi-nation of the variance component estimates for asub-domain. The analysis per booklet yields esti-mates per booklet, but the booklet is not interestingas a reporting unit; and so the estimates must becombined in some way across booklets. In thecase of a two-way table, the estimation equationsare (see equation (76)):
, (77)
where b indexes the booklet and p is any elementfrom . If it can be assumed thatstudents, items and markers are randomly assignedto booklets, the summation of (77) across bookletswill yield estimation equations for the variancecomponents which yield consistent estimates.Therefore, the following equations can be used:
. (78)
For the three-way tables used in the presentstudy, similar estimation equations were used,with the right-hand side of (78) having seven
SS A B Rbpb
bpb
bpb
bpb
∑ ∑ ∑ ∑= + +σ σ σα β ε2 2 2
tot row col res, , ,
SS A B Rbp bp bp bp= + +σ σ σα β ε2 2 2
σε2
σ βγ2
σαγ2
σαβ2
σ γ2
σ β2
σα2
Table 27: Coefficients of the Three Variance Components (Complete Case)
Student Component Item Component Measurement Error Component
( ) ( ) ( )
0
0
0 0 N V I− − +1E SSres( )
I −1N V−E SScol( )
V −1N I−E SSrow( )
N −1N V−N I−E SStot( )
σε2σ β
2σα2
unknown variance components instead of three. Thevariance components for the mathematics domainare displayed in Table 29 for the Netherlands.
A number of comments are required withrespect to this table.• The table is exemplary for all analyses in the
five domains for countries where the markinginstructions were rigorously followed. A morecomplete set of results is given in Chapter 14.
• The most comforting finding is that allvariance components where markers areinvolved (the three columns containing thesymbol as a subscript of the variancecomponent) are negligibly small, meaning thatthere are no systematic marker effects.
• The results of Booklet 3 are slightly outlying,yielding a much smaller student effect and amuch larger item effect than the other booklets.This could be attributed to sampling error (thestandard errors of the variance componentestimates for main effects are notoriouslyhigh; Searle, 1971) but it could also point to asystematic error in the composition of thebooklets. Since it concerns three items only,this effect has not been investigated further.
• The most important variance component is theinteraction between students and items. If themarker effects all vanish, the remaining com-ponents show that the scores are not additivein student effect and item effect. At first sight,this is contradictory to the use of the Raschmodel as the measurement model, but a closerlook reveals an essential difference between theRasch model and the present approach. In IRTmodels, a transformation to an unbounded
γ
PIS
A 2
00
0 t
ech
nic
al r
epor
t
124
Table 29: Variance Components (%) for Mathematics in the Netherlands
Student- Student- Item- Measure-Item Marker Marker ment
Number Student Item Marker Interaction Interaction Interaction Errorof Component Component Component Component Component ComponentComponent
Booklet Items ( ) ( ) ( ) ( ) ( ) ( ) ( )
1 7 37.0 15.6 0.0 41.9 0.0 0.0 5.5
3 3 6.9 43.2 0.1 40.5 0.6 0.0 8.8
5 7 19.3 34.3 0.2 38.3 -0.2 0.2 8.0
8 6 19.3 27.8 0.1 41.8 -0.1 0.4 10.7
9 4 19.3 27.8 0.1 47.8 -0.1 0.1 5.4
All 27 23.6 26.4 0.1 42.1 0.1 0.2 7.5
σε2σ βγ
2σαγ2σαβ
2σ γ2σ β
2σα2
latent scale is used, whereas in the presentapproach the scores themselves are modelled,and since the scores are bounded (and highlydiscrete), differences in difficulty of the itemswill automatically lead to interaction effects.
• Using the bottom row of Table 29, the general-isability coefficient (63) can be computedfor different values of I and M. These estimatesare displayed, using data from the Netherlands,in Table 30. Extensive tables for all participatingcountries are given in Chapter 14. Here also,the result is exemplary for all countries, althoughthe indices for reading are generally slightlylower than for mathematics. But the main con-clusion is that the correlations are quite high,even with one marker, such that the decisionto use a single marker for the open-ended itemsin the main study seems justified post hoc.
Table 30: Estimates of for the MathematicsScale in the Netherlands
I = 8 I = 16 I = 24M = 1 M = 2 M = 1 M = 2 M = 1 M = 2
0.963 0.981 0.977 0.988 0.982 0.997
Inter-country RaterReliability Study-Design
Aletta Grisay
The PISA 2000 quality control procedures includeda limited Inter-Country Rater Reliability (ICR)study, conducted in October-November 2000, to
ρ3
ρ3
Cod
ing
and
mar
ker
reli
abil
ity
stu
die
s
125
investigate the possibility of systematic bias inmarking open-ended reading items. The within-country multiple-marking exercise explored thereliability of the coding completed by the nationalmarkers in each country. The objective of the ICRstudy was to check on whether, globally, or forparticular items, marks given by the differentnational staffs could be considered equivalent.
A subset of 48 Booklet 7s that is, the sampleincluded in the multiple-marking exercise forreading within each country was used for theICR study. Booklet 7 was chosen because itcontained only reading material, and therefore alarge number of reading items (26) requiringdouble marking. All participating countries wererequested to submit these 48 student booklets tothe Consortium, after obliterating any scoregiven by the national markers.
Staff specially trained by the Consortium forthe study and proficient in the various PISAlanguages then marked the booklets. Their markswere compared to the four marks given by thenational markers to the same students’ answers.
All cases of clear discrepancy (between the marksgiven by the independent marker and by all or mostof the national markers) were submitted for adjud-ication to Consortium researchers involved indeveloping the reading material, along with theactual student’s answer (translated for countriesusing languages other than English or French).
Recruitment of InternationalMarkers
The international scoring of booklets from theseven English-speaking PISA countries (withEngland and Scotland considered as separatecountries for this exercise) was done by threeexperienced Australian markers and oneCanadian marker, all selected by the Consortium.The 48 booklets from Australian students weresent to Canada, to be scored by the trainer forPISA reading in English-speaking Canada. Theremaining six English-speaking groups (fromEnglish-speaking Canada, Ireland, New Zealand,Scotland, the United Kingdom and the UnitedStates) were re-scored at ACER by a group ofthree verifiers selected from the Australian PISAadministration. One of these was the trainer forPISA reading in Australia. The other two haddemonstrated a high level of reliability during theAustralian national operational marking. Each
one had sole responsibility for scoring bookletsfrom two countries.
To cover instruction languages used in all otherPISA countries, the Consortium appointed 15 ofthe 25 translators who had already served in theverification of the national translations of PISAmaterial, and were therefore very familiar withit. The selection criteria retained were (i) perfectcommand of the language(s) of the country (orcountries) whose material had to be scored (asfar as possible, the responsible persons werechosen from among those verifiers who masteredmore than one PISA language, so that each couldmark for two or more countries); (ii) perfectcommand of English and/or French, so that eachcould score on the basis of either source versionof the reading Marking Guide (while checking,when needed, the national version); and (iii) asfar as possible, previous experience teachingeither the national language or foreign languagesat the secondary school level.
Inter-country Rater Reliability StudyTraining Sessions
A three-day training session was conducted inBrussels to train the 15 verifiers to use the PISAreading Marking Guide. Consortium experts ledthe training session. For the three Australian veri-fiers who had worked on the Australian markingrecently (in September-October) and were veryfamiliar with the items, a shorter refreshersession was conducted in Melbourne, Australia.No special training or refresher session wasdeemed necessary for the Canadian verifier.
The session materials included the sourceEnglish (and French) versions of readingmarking instructions for Booklet 7 items (and acopy of the national marking guide for eachlanguage other than English, when available),answers to marking queries on Booklet 7received at ACER, the Reading Workshopmaterial (see Chapter 2), and copies of abenchmark set of English student scripts to bescored independently by all participants to verifytheir own marking consistency.
This English set of benchmark responses wasprepared at ACER as a specific training aid.Most were selected from material sent for thepurpose by Ireland, Scotland, the United Kingdomand the United States. Australian booklets andthe query records provided other sources. For
each item, about 15 responses were chosen topresent typical and borderline cases and to illustrateall likely response categories. A score was provi-ded for each response, accompanied in mostcases by an explanation or comment. Scores andexplanations were in hidden text format.
During the session, the verifiers worked throughthe material, item by item. Scoring instructions foreach item were presented, then the workshopexamples were marked and discussed, then theverifiers worked independently on the relatedbenchmark scripts and checked their scores againstthe recommended scores. Problem responses werediscussed with Consortium staff conducting thetraining session before proceeding to the next item.
At the Brussels session, the verifiers also hadthe opportunity to start marking part of thematerial for one of the countries for which theywere responsible, under the supervision of theConsortium staff. They were instructed on howto enter their marks in the ICR study softwareprepared at ACER to help compare their markswith those given by the national markers. Mostfinished marking at least one of the Booklet 7clusters for one of their countries. They then re-ceived an Excel spreadsheet with the cases wheretheir scores differed from those given by thenational markers. They discussed the first fewproblem cases with the Consortium staff, andreceived instructions on how to enter their com-ments and the translation of the student’s answerin a column next to the flagged case, so that theConsortium staff could adjudicate the case.
A total of about 15 hours was needed tofinish scoring the booklets for each country.
Flag Files
For each country, the ICR study software produceda flag file with about 1 248 cases (48 students ×26 items)—numbers varied slightly, since a fewcountries submitted one or two additional booklets,or could not retrieve some of the booklets requested,or submitted booklets with a missing page.
In the file, an asterisk indicated cases where theverifier’s score differed significantly from the fourscores given by the national markers (e.g., 0 v. 1,respectively, or vice versa; or all national markersgiving various partial scores while the verifiergave full or no credit).
Cases with minor discrepancies only (e.g., wherethe verifier agreed with three out of four of the
national markers) were not flagged, nor werecases with national scores too inconsistent to becompared to the verifier’s score (e.g., wherenational marks were 0011 and the verifier’s markwas 1, or where national marks were 0123 and theverifier’s mark was 2). It was considered that thesecases would be identified in the within-countryreliability analysis but would be of less interestfor the ICR study than cases with a more clearorientation towards leniency or harshness. In eachfile, a blank column was left for Consortium staffto adjudicate flagged cases.
Adjudication
In the adjudication stage, Consortium staff checkedall flagged cases to verify if marks were correct,and if both national markers and the verifier hadused wrong codes in these problematic cases.
For the English-speaking countries, the Con-sortium’s adjudicators checked the flagged casesby referring directly to students’ answers in thebooklets. Responses flagged for adjudication weremarked blind by Consortium experts (i.e., theadjudicator saw none of the previous five marks).Subsequently, that mark was checked against theoriginal four national marks and the verifier’smark. If the adjudicator’s mark was differentfrom more than two of the original four nationalmarks, the response was submitted for a furtherblind marking by the other adjudicator.
For the non English-speaking countries, theprocedure was different, since the flagged students’answers usually had to be translated by theverifier before they could be adjudicated. To reducethe costs of the exercise, verifiers were instructed(i) to simply copy but not translate the student’sanswer next to the flagged case for those languagesknown by the adjudicators (i.e., French, German,Italian, Dutch and Spanish), and (ii) to simplyadd a brief comment, without translating theanswer, for cases when they clearly recognisedthat they had entered an incorrect code. All otherflagged cases were translated into English orFrench. Almost all files for the non English-speaking countries were adjudicated twice. Noblind procedure could be used. Each adjudicatorentered comments in the file to explain therationale for their decision, so that a finalconsensus could be reached.
The results of this analysis are given inChapter 14.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
126
Data Cleaning Procedures
Christian Monseur
This chapter presents the data cleaning stepsimplemented during the main survey of PISA 2000.
Data cleaning at theNational Centre
National Project Managers (NPMs) were requiredto submit their national data in KeyQuest® 2000,the generic data entry package developed by Con-sortium staff and pre-configured to include the dataentry forms, referred to later as instruments: theachievement test booklets one to nine (togethermaking up the cognitive data); the SE booklet (alsocognitive, known as booklet 0); multiple-markingsheets; School Questionnaire and StudentQuestionnaire instruments with and without thetwo international options (Computer Familiarityor Information Tech-nology (IT) questionnaireand Cross-Curriculum Competency (CCC)questionnaire); the List of Schools; and theStudent Tracking Forms.
The data were verified at several points(integrity check) from the time of data entry.Validation rules (or range checks) were specifiedfor each variable defined in KeyQuest®, and avariable datum was only accepted if it satisfiedpre-specified validation rules.1 To prevent dupli-cate records, a set of variables assigned to aninstrument were identified as primary keys. Forthe student test booklets, stratum, school andstudent identifications were the primary keys.
Countries were requested to enter data into theStudent Tracking Form module before starting toenter data for cognitive tests or context question-naires. This module, or instrument, containedcomplete student identification as it should haveappeared on the booklet and questionnaire thatthe student received at the testing session. Whenconfigured, KeyQuest® instruments designed forstudent data were linked with the StudentTracking Form so that warning messagesappeared when data operators tried to enterinvalid student identifiers or student identifiersthat did not match a record in the form.
After the data entry process was completed,NPMs were required to implement somechecking procedures using KeyQuest® beforesubmitting data to the Consortium, and torectify any integrity errors. These includedinconsistencies between: the List of Schools andthe School Questionnaire; the Student TrackingForm and achievement test booklets; the StudentTracking Form and the Student Questionnaire;the achievement test booklets and the StudentQuestionnaire; and, in the multiple-marking,reliability data according to the internationaldesign in order to detect other than fourduplicate students per booklet.
NPMs were also required to submit anAdaptation and Modification Form with theirdata, describing all changes they had made tovariables on the questionnaire, including theaddition or deletion of variables or responsecategories for variables, and changes to themeaning of response categories. NPMs were alsorequired to propose recoding rules where thenational data did not match the international data.
1 National centres could modify the configuration ofthe variables, giving a range of values that wassometimes reduced or extended compared toConsortium expectations.
Chapter 11
Data Cleaning at theInternational Centre
Data Cleaning Organisation
Data cleaning was a major component of thePISA 2000 Quality Control and AssuranceProgram. It was of prime importance that theConsortium detected all anomalies andinconsistencies in submitted data and that noerrors were introduced during the cleaning andanalysis phases. To reach these high qualityrequirements, the Consortium implemented dualindependent processing.
Two data analysts developed the PISA 2000data cleaning procedures independently. At eachstep, the procedures were considered completeonly when their application to a fictitiousdatabase and to the first two PISA databasesreceived from countries produced identicalresults and files.
The data files submitted by national centresoften needed specific data cleaning or recodingprocedures, or at least adaptation of standarddata cleaning procedures. Two analysts thereforeindependently cleaned all submitted data files;the cleaning and analysis procedures were runwith both SAS® and SPSS®.
Three teams of data analysts produced thenational databases. A team leader wasnominated within each team and was the onlyindividual to communicate with the nationalcentres. Each team used SAS® and SPSS®.
Data Cleaning Procedures
Because of the potential impact of PISA resultsand the scrutiny to which the data were likely tobe put, it was essential that no dubious recordsremained in the data files. During cleaning asmany dubious records as possible were identified,and through a process of extensive discussionbetween each national centre and the Consortium’sdata processing centre at the Australian Councilfor Educational Research (ACER), an effort wasmade to correct and resolve all data issues. Whenno adequate solution was found, the offendingdata records were deleted.2
PIS
A 2
00
0 t
ech
nic
al r
epor
t
128
Unresolved inconsistencies in student and schoolidentifications also led to the deletion of recordsin the database. Unsolved systematic errors for aparticular item were replaced by not applicable codes.For instance, if countries reported a mistranslationor misprint in the national version of a cognitivebooklet, data for the variables were recoded asnot applicable and were not used in the analyses.Finally, errors or inconsistencies for particularstudents and particular variables were replacedby not applicable codes.
National Adaptations to the Database
When data arrived at the Consortium, the firststep was to check the consistency of the databasestructure with the international databasestructure. An automated procedure wasdeveloped for this purpose. For each instrumentit reported deleted variables, added variables andvariables for which the validation rules (rangechecks) had been changed. This report was thencompared with the information provided by theNPM in the Adaptation and Modification Form.
Once all inconsistencies were resolved, thesubmitted data were recoded where necessary tofit the international structure. All additional ormodified variables were set aside in a separatefile so that countries could use these data fortheir own purposes but they were not includedin the international database and have not beenused in international data processing.
Verifying the Student Tracking Formand the List of Schools
The Student Tracking Form and the List ofSchools were central instruments, because theycontained the information used in computingweight, exclusion, and participation rates. TheStudent Tracking Form contained all studentidentifiers, exclusion and participation codes, thebooklet number assigned and some demographicdata.3 The List of Schools contained, amongother variables, the PISA population size, thegrade population size and the sample size. Theseforms had to be submitted electronically.
The data quality in these two forms and theirconsistency with the booklets and Student
2 Record deletion was strenuously avoided as itdecreased the participation rate.
3 See Appendix 11 for an example.
Questionnaire data were verified for: • consistency of exclusion codes and
participation codes (for each session) with thedata in the test booklets and questionnaires;
• consistency of the sampling information in theList of Schools (i.e., target population size andthe sample size) with the Student TrackingForm;
• within-school student samples selected inaccordance with required internationalprocedures; and
• consistency of demographic information in theStudent Tracking Form (grade, date of birthand gender) with that in the booklets orquestionnaires.
Verifying the Reliability Data
Some 100 checking procedures wereimplemented to check the following componentsof the multiple-marking design (see Chapter 6):• number of records in the reliability files;• number of records in the reliability files and
the corresponding booklets;• marker identification consistency;• marker design;• selection of the booklets for multiple marking;
and• extreme inconsistencies in the marks given by
different markers (see Chapter 14).4
Verifying the Context QuestionnaireData
The Student and School Questionnaire dataunderwent further checks after recoding fornational adaptations. Invalid or suspiciousstudent and school data were reported to anddiscussed with countries. Four types ofconsistency checks were run:• Non-valid sums: For example, two questions
in the School Questionnaire (SCQ04 andSCQ08) each requested the school principalto provide information as a percentage. Thesum of the values had to be 100.
• Implausible values: Consistency checks acrossvariables within instruments combined withthe information of two or more questions todetect suspicious data, like the number of
Dat
a cl
ean
ing
pro
ced
ure
s
129
students and the number of teachers. Outlyingratios were identified and countries wererequested to check the correctness of thenumbers. These checks included:• Identifying numbers of students (SCQ02)
and numbers of teachers (SCQ14). Ratiosoutside ±2 standard deviations wereconsidered outliers;
• Identifying numbers of computers (SCQ13)and numbers of students (SCQ02). Ratiosoutside ±2 standard deviations wereconsidered outliers;
• Comparing the mother’s completededucation level in terms of the InternationalStandard Classification of Education(ISCED, OECD, 1999b) categories forISCED 3A (STQ12) and ISCED 5A or higher(STQ14). If the mother did not completeISCED 3A, she could not have completed5A; and
• Comparing the father’s completededucation level similarly.
• Outliers: Numerical answers from the SchoolQuestionnaire were standardised and outliervalues (±3 standard deviations) were returned tonational centres for checking.5
• Missing data confusion: Possible confusionsbetween valid codes 9, 99, 999, 90, 900 andmissing values were encountered during thefield trial. Therefore, for all numericalvariables, values close to the missing codeswere returned to countries for verification.
Preparing Files forAnalysis
For each PISA participating country, several fileswere prepared for use in analysis:• a processed cognitive file with all student
responses to all items for the three domainsafter coding;
• a raw cognitive file with all student responses toall items for the three domains before coding;
4 For example, some markers reported a missing valuewhile others reported non-zero scores.
5 The questions checked in this manner were schoolsize (SCQ02), instructional time (SCQ06), number ofcomputers (SCQ13), number of teachers (SCQ14),teacher professional development (SCQ15), number ofhours per week in each of test language, mathematicsand science classes (STQ27), class size (STQ28) andschool marks (STQ41).
• a student file with all data from the StudentQuestionnaire and the two internationaloptions (CCC and IT);
• a school file with data from the SchoolQuestionnaire (one set of responses per school);
• a weighting file with information from theStudent Tracking Form and the List of Schoolsnecessary to compute the weights; and
• reliability files—19 files with recoded studentanswers and 19 files with scores, pertaining totwo domains from each of six booklets, threedomains from each of two booklets and onedomain from the remaining booklet, wereprepared to facilitate the reliability analyses.
Processed Cognitive File
For a number of items in the PISA test booklets,a student score on the item was determined bycombining multiple student responses. Mostrecoding required combining two answers intoone and summarising the information from thecomplex multiple-choice items.
In the PISA material, some of the open-endedmathematics and science items were coded in twodigits while all other items were coded with aone-digit mark. ConQuest, the software used toscale the cognitive data, requires items of thesame length. To minimise the size of the data file,the double-digit items were recoded into one-digitvariables using the first digit. All data producedthrough scoring, combining or recoding have Tadded to their item’s variable label.
For items omitted by students, embeddedmissing and non-reached missing items weredifferentiated. All consecutive missing valuesstarting from the end of each cognitive session(that is, from the end of the first hour and fromthe end of the second hour separately) werereplaced by a non-reached code (r), except forthe first value of the missing series. Embeddedand non-reached missing items were treateddifferently in the scaling.
Non-reached items for students who werereported to have left the session earlier thanexpected or arrived later than expected (partparticipants) were considered not applicable inall analyses.
The Student Questionnaire
The Student Questionnaire file includes the datacollected from the main questionnaire, thecomputer familiarity international option andthe cross-curriculum competency internationaloption. If a country did not participate in theinternational options, not applicable codes wereused for all variables from them.
Six derived variables were computed duringthe data cleaning process:• Father’s occupation, mother’s occupation and
the student’s expected occupation at age 30,which were each originally coded using ISCO,were transformed into the Ganzeboom, deGraaf and Treiman (1992) InternationalSocio-Economic Index.
• Question STQ41 regarding school marks is pro-vided in three formats: nominal, ordinal andnumerical. The nominal option is used if thecountry provided data collected withquestion 41b—that is, above the pass mark, atthe pass mark and below the pass mark. Datacollected through question 41a were codedaccording to the reporting policy in the countries.Some countries submitted data in a range of1-5, or 1-7 etc., while others reported studentmarks on a scale with maximum score of 20,or 100. These data were recoded in categoricalformat if fewer than eight categories wereprovided (1-7) or in percentages if more thanseven categories were provided.The selection of students included in the
questionnaire file was based on the same rules asfor the cognitive file.
The School Questionnaire
No modifications other than the correction ofdata errors and the addition of the country four-digit codes were made to the School Questionnairefile. All participating schools, i.e., any school forwhich at least one PISA-eligible student wasassessed, have a record in the internationaldatabase, regardless of whether or not theyreturned the School Questionnaire.
The Weighting Files
The weighting files contained the information fromthe Student Tracking Form and from the List ofSchools. In addition, the following variables werecomputed and included in the weighting files.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
130
• For each of the three domains and for eachstudent, a scalability variable was computed.If all items for one domain were missing, thenthe student was considered not scalable. Thesethree variables were useful for selecting thesample of students who would contribute tothe international item calibration.
• For each student, a participation indicatorwas computed. A student who participated inthe first hour of the testing session6 (and didnot arrive late) was considered a participant.A student who only attended the secondcognitive session and/or StudentQuestionnaire session was not considered aparticipant. This variable was used tocompute the student participation rate.
• For each student, a scorable variable wascomputed. All students who attended at leastone cognitive session (that is, either or bothhours of the session) were consideredscorable. Further, if a student only attendedthe Student Questionnaire session andprovided data for the father’s or mother’soccupation questions, then the student wasalso considered scorable. Therefore, anindicator was also computed to determinewhether the student answered the father’s ormother’s occupation questions.A few countries submitted data with a grade
national option. Therefore, two eligibilityvariables—PISA-eligible and grade-eligible—were also included in the Student TrackingForm. These new variables were also used toselect the records that were included in theinternational database and therefore in thecognitive file and Student Questionnaire file. Tobe included in the international database, thestudent had to be both PISA-eligible andscorable. In other words, all PISA students whoattended one of the cognitive sessions wereincluded in the international database. Thosewho attended only the questionnaire sessionwere included if they provided an answer for thefather’s or the mother’s occupation questions.
PISA students reported in the StudentTracking Form as not eligible, no longer atschool, excluded for physical, mental, orlinguistic reasons, or absent were not included inthe international database. Students who refused
Dat
a cl
ean
ing
pro
ced
ure
s
131
to participate in the assessment were also notincluded in the international database.
All non-PISA students, i.e., students assessedin a few countries for a national or internationalgrade sample option, were excluded from theinternational database. Countries submittingsuch data to the Consortium received themseparately.
The Reliability Files
One file was created for each domain and testbooklet. The data from the reliability bookletswere merged with those in the test booklets sothat each student selected for the multiple-marking process appears four times in these files.
6 Of either the original session or a make-up session.
Sampling Outcomes
Christian Monseur, Keith Rust and Sheila Krawchuk
This chapter reports on PISA sampling outcomes.Details of the sample design are given in Chapter 4.
Table 31 shows the various quality indicatorsfor population coverage and the various piecesof information used to derive them. Thefollowing notes explain the meaning of eachcoverage index and how the data in each columnof the table were used.
Indices 1, 2 and 3 are intended to measurePISA population coverage. Indices 4 and 5 areintended to be diagnostic in cases where indices1, 2 or 3 have unexpected values. Manyreferences are made in this chapter to the variousSampling Forms on which NPMs documentedstatistics and other information needed inundertaking the sampling. The forms themselvesare included in Appendix 9.
Index 1: Coverage of the National DesiredPopulation, calculated by P/(P+E) × 3[c]/3[a].• The National Desired Population (NDP),
defined by Sampling Form 3 response box [a]and denoted here as 3[a], is the populationthat includes all enrolled 15-year-olds in eachcountry (with the possibility of small levels ofexclusions), based on national statistics.However, the final NDP reflected on eachcountry’s school sampling frame might havehad some school-level exclusions. The valuethat represents the population of enrolled 15-year-olds minus those in excluded schoolsis represented by response box [c] onSampling Form 3 and denoted here as 3[c].Thus, the term 3[c]/3[a] provides theproportion of the NDP covered in eachcountry based on national statistics.
• The value (P+E) provides the weightedestimate from the student sample of alleligible 15-year-olds in each country, where Pis the weighted estimate of eligible non-excluded 15-year-olds and E is the weightedestimate of eligible 15-year-olds that wereexcluded within schools. Therefore, the termP/(P+E) provides an estimate based on thestudent sample of the proportion of theeligible 15-year-old population represented bythe non-excluded eligible 15-year-olds.
• Thus the result of multiplying these twoproportions together (3[c]/3[a] and P/(P+E))indicates the overall proportion of the NDPcovered by the non-excluded portion of thestudent sample.
Index 2: Coverage of the National EnrolledPopulation, calculated by P/(P+E) × 3[c]/2[b].• The National Enrolled Population (NEP),
defined by Sampling Form 2 response box [b]and denoted here as 2[b], is the populationthat includes all enrolled 15-year-olds in eachcountry, based on national statistics. The finalNDP, denoted here as 3[c], reflects the 15-year-old population from each country’sschool sampling frame. This value representsthe population of enrolled 15-year-olds lessthose in excluded schools.
• The value (P+E) provides the weightedestimate from the student sample of alleligible 15-year-olds in each country, where Pis the weighted estimate of eligible non-excluded 15-year-olds and E is the weightedestimate of eligible 15-year-olds that wereexcluded within schools. Therefore, the term
section four: quality indicators and outcomes
Chapter 12
P/(P+E) provides an estimate based on thestudent sample of the proportion of theeligible 15-year-old population that isrepresented by the non-excluded eligible 15-year-olds.
• Multiplying these two proportions together(3[c]/2[b] and P/(P+E)) gives the overallproportion of the NEP that is covered by thenon-excluded portion of the student sample.
Index 3: Coverage of the National 15-Year-OldPopulation, calculated by P/2[a].• The National Population of 15-year-olds,
defined by Sampling Form 2 response box [a]and denoted here as 2[a], is the entirepopulation of 15-year-olds in each country(enrolled and not enrolled), based on nationalstatistics. The value P is the weighted estimateof eligible non-excluded 15-year-olds from thestudent sample. Thus P/2[a] indicates theproportion of the National Population of 15-year-olds covered by the non-excludedportion of the student sample.
Index 4: Coverage of the Estimated SchoolPopulation, calculated by (P+E)/S.• The value (P+E) provides the weighted
estimate from the student sample of alleligible 15-year-olds in each country, where Pis the weighted estimate of eligible non-excluded 15-year-olds and E is the weightedestimate of eligible 15-year-olds who wereexcluded within schools.
• The value S is an estimate of the 15-year-oldschool population in each country. This isbased on the actual or (more often)approximate number of 15-year-olds enrolledin each school in the sample, prior tocontacting the school to conduct theassessment. The S value is calculated as thesum over all sampled schools of the productof each school’s sampling weight and itsnumber of 15-year-olds (ENR) as recorded onthe school sampling frame. In the infrequentcase where the ENR value was not available,the number of 15-year-olds from the StudentTracking Form was used.
• Thus, (P+E)/S is the proportion of theestimated school 15-year-old population that isrepresented by the weighted estimate from thestudent sample of all eligible 15-year-olds. Itspurpose is to check whether the student
sampling has been carried out correctly, and toassess whether the value of S is a reliablemeasure of the number of enrolled 15-year-olds.This is important for interpreting Index 5.
Index 5: Coverage of the School Sampling FramePopulation, calculated by S/3[c].• The value S/3[c] is the ratio of the enrolled
15-year-old population, as estimated fromdata on the school sampling frame, to the sizeof the enrolled student population, asreported on Sampling Form 3. In some cases,this provides a check as to whether the dataon the sampling frame give a reliable estimateof the number of 15-year-olds in each school.In other cases, however, it is evident that 3[c]has been derived using data from the samplingframe by the National Project Manager, sothat this ratio may be close to 1.0 even ifenrolment data on the school sampling frameare poor. Under such circumstances, Index 4will differ noticeably from 1.0, and the figurefor 3[c] will also be inaccurate.Tables 32, 33 and 34 present school and
student-level response rates. Table 32 indicatesthe rates calculated by using only originalschools and no replacement schools. Table 33indicates the improved response rates when firstand second replacement schools were accountedfor in the rates. Table 34 indicates the studentresponse rates among the full set of participatingschools.
For calculating school response rates beforereplacement, the numerator consisted of all originalsample schools with enrolled age-eligible studentswho participated (i.e., assessed a sample of eligiblestudents, and obtained a student response rate ofat least 50 per cent). The denominator consistedof all the schools in the numerator, plus thoseoriginal sample schools with enrolled age-eligiblestudents that either did not participate or failedto assess at least 50 per cent of eligible samplestudents. Schools that were included in thesampling frame, but were found to have no age-eligible students, were omitted from thecalculation of response rates. Replacementschools do not figure in these calculations.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
134
Sam
pli
ng
outc
omes
135Tab
le 3
1:
Sam
plin
g an
d C
over
age
Rat
es
Sam
plin
g Sh
eet
Info
rmat
ion
Sam
ple
Info
rmat
ion
Part
icip
ants
Excl
uded
Inel
igib
leEl
igib
leC
over
age
Indi
ces
All
Enro
lled
Nat
iona
lSc
hool
Nat
iona
l%
Enro
lled
Act
ual
W’td
Act
ual
W’td
Act
ual
W’td
Act
ual
W’td
With
in-
Ove
r-In
eli-
15-
15-
Tar
get
Leve
lT
arge
tSc
hool
-St
uden
tsSc
hool
all
gibl
eye
ar-
year
-Ex
cl
Min
usLe
vel
onEx
cl’n
Excl
’n(%
)1
23
45
olds
olds
’ns
Scho
ol
Excl
Scho
olR
ate
Rat
eEx
cl’n
s’n
sFr
ame
(%)
(%)
SF 2
[a]
SF 2
[b]
SF 3
[a]
SF 3
[b]
SF 3
[c]
3[b]
/3[a
] S
PE
I &
WP+
EE/
(P+E
)
Aus
tral
ia
266
878
248
908
248
738
2 85
024
5 88
81.
1524
4 15
75
176
229
152
632
688
233
6 79
55
239
231
840
1.16
2.29
2.93
0.98
0.98
0.86
0.95
0.99
Aus
tria
95
041
90 3
5490
354
3290
322
0.04
86 6
014
745
71 5
4741
500
259
3 38
94
786
72 0
470.
690.
734.
700.
990.
99•7
•70.
96
Bel
gium
112
1 12
111
9 05
511
8 97
21
091
117
881
0.92
117
836
6 67
011
0 09
510
01
596
3657
26
770
111
691
1.43
2.33
0.51
0.98
0.98
0.91
0.95
1.00
Bel
gium
(Fl.)
71 0
7468
995
68 9
1289
768
015
1.30
69 1
103
890
61 1
4325
428
1112
83
915
61 5
710.
701.
990.
210.
980.
980.
860.
891.
02
Bel
gium
(Fr.)
49 2
8949
289
49 2
8919
449
095
0.39
48 7
262
780
48 9
5275
1 16
825
443
2 85
550
120
2.33
2.72
0.88
0.97
0.97
0.99
1.03
0.99
Bra
zil2
3 4
64 3
301
841
843
1 83
7 23
66
633
1 83
0 60
30.
3624
90 7
884
893
2 4
02 2
8014
7 84
221
780
666
4 90
72
410
122
0.33
0.69
3.35
0.99
0.99
0.69
0.97
1.36
Can
ada
403
803
396
423
391
788
2 03
538
9 99
00.
5238
1 16
529
687
348
481
1 58
416
197
1 70
919
032
31 2
7136
4 67
84.
444.
945.
220.
950.
940.
860.
960.
98
Cze
ch R
epub
lic
134
627
132
508
132
508
2 18
113
0 32
71.
6512
9 42
25
365
125
639
1329
781
1 61
95
378
125
936
0.24
1.88
1.29
0.98
0.98
0.93
0.97
0.99
Den
mar
k 53
693
52 1
6152
161
345
51 8
160.
6650
236
4 23
547
786
119
1 19
599
1 06
84
354
48 9
812.
443.
082.
180.
970.
970.
890.
980.
97
Finl
and
66 5
7166
561
66 3
1955
065
769
0.83
65 8
754
864
62 8
2658
673
2226
14
922
63 4
991.
061.
880.
410.
980.
980.
940.
961.
00
Fran
ce
788
387
788
387
750
460
17 7
2873
2 73
22.
3674
4 75
44
673
730
494
598
208
416
077
4 73
273
8 70
21.
113.
450.
820.
970.
920.
930.
991.
02
Ger
man
y 92
7 47
392
4 54
992
4 54
95
423
919
126
0.59
935
223
5 07
382
6 81
660
9 16
38
1 05
35
133
835
979
1.10
1.68
0.13
0.98
0.98
0.89
0.89
1.02
Gre
ece
128
175
124
656
124
187
200
123
987
0.16
110
622
3 64
411
1 36
321
682
351
039
3 66
511
2 04
50.
610.
770.
930.
990.
990.
871.
010.
89
Hun
gary
312
0 75
911
5 32
511
5 32
50
115
325
0.00
131
291
4 88
710
7 46
034
765
791
762
4 92
110
8 22
40.
710.
711.
630.
990.
990.
890.
821.
14
Icel
and
4 06
24
044
4 04
418
4 02
60.
454
020
3 37
23
869
7979
1515
3 45
13
949
2.01
2.44
0.38
0.98
0.98
0.95
0.98
1.00
Irel
and
65 3
3964
370
63 5
721
021
62 5
511.
6162
138
3 85
456
209
134
1 73
412
01
501
3 98
857
942
2.99
4.55
2.59
0.95
0.94
0.86
0.93
0.99
Ital
y 58
4 41
757
4 86
457
4 86
477
557
4 08
90.
1356
2 76
34
984
510
792
117
12 2
470
05
101
523
039
2.34
2.47
0.00
0.98
0.98
0.87
0.93
0.98
Japa
n 1
490
000
1 48
5 26
91
459
296
34 1
241
425
172
2.34
1 42
0 53
35
256
1 44
6 59
60
00
05
256
1 44
6 59
60.
002.
340.
000.
980.
960.
971.
021.
00
Kor
ea
712
812
602
605
602
605
1 82
060
0 78
50.
3058
9 01
84
982
579
109
682
681
8 72
24
988
579
936
0.14
0.44
1.50
1.00
1.00
0.81
0.98
0.98
Latv
ia38
000
35 9
8135
981
886
35 0
952.
4635
629
3 92
030
063
6240
29
603
982
30 4
651.
323.
750.
200.
960.
960.
790.
861.
02
Liec
hten
stei
n 41
532
632
60
326
0.00
327
314
325
22
00
316
327
0.61
0.61
0.00
0.99
0.99
0.78
1.00
1.00
Luxe
mbo
urg
4 55
64
556
4 55
641
64
140
9.13
4140
3 52
84
138
00
00
3 52
84
138
0.00
9.13
0.00
0.91
0.91
0.91
1.00
1.00
Mex
ico
2 12
7 50
41
098
605
1 07
3 31
701
073
317
0.00
1 06
3 52
44
600
960
011
256
439
274
897
4 60
296
0 57
50.
060.
067.
801.
000.
980.
450.
900.
99
PIS
A 2
00
0 t
ech
nic
al r
epor
t
136 Tab
le 3
1 (c
ont.)
Sam
plin
g Sh
eet
Info
rmat
ion
Sam
ple
Info
rmat
ion
Part
icip
ants
Excl
uded
Inel
igib
leEl
igib
leC
over
age
Indi
ces
All
Enro
lled
Nat
iona
lSc
hool
-N
atio
nal
%En
rolle
dA
ctua
lW
’tdA
ctua
lW
’tdA
ctua
lW
’tdA
ctua
lW
’tdW
ithin
-O
ver-
Inel
i-15
-15
-T
arge
tLe
vel
Tar
get
Scho
ol-
Stud
ents
Scho
olal
lgi
ble
year
- ye
ar-
Excl
M
inus
Leve
lon
Excl
’nEx
cl’n
(%)
12
34
5ol
dsol
ds’n
sSc
hool
Ex
clSc
hool
Rat
eR
ate
Excl
’ns
’ns
Fram
e(%
)(%
)
SF 2
[a]
SF 2
[b]
SF 3
[a]
SF 3
[b]
SF 3
[c]
3[b]
/3[a
] S
PE
I &
WP+
EE/
(P+E
)
Net
herl
ands
17
8 92
417
8 92
417
8 92
47
800
171
124
4.36
180
697
2 50
315
7 32
71
2334
2 60
62
504
157
350
0.01
4.37
1.66
0.96
0.96
0.88
0.87
1.06
New
Zea
land
54
220
51 4
6451
464
976
50 4
881.
9050
645
3 66
746
757
137
1 59
024
52
712
3 80
448
347
3.29
5.12
5.61
0.95
0.95
0.86
0.95
1.00
Nor
way
52
165
51 5
8751
474
420
51 0
540.
8250
271
4 14
749
579
9394
429
278
4 24
050
523
1.87
2.67
0.55
0.97
0.97
0.95
1.01
0.98
Pola
nd4
665
500
643
528
643
528
56 5
2458
7 00
48.
7854
6 84
23
654
542
005
535
484
263
399
3 70
754
7 48
91.
009.
700.
620.
900.
900.
811.
000.
93
Port
ugal
13
2 32
512
7 16
512
7 16
50
127
165
0.00
126
505
4 58
599
998
122
2 77
726
55
284
4 70
710
2 77
52.
702.
705.
140.
970.
970.
760.
810.
99
Rus
sian
Fed
erat
ion
2 26
8 56
62
259
985
2 25
9 98
510
867
2 24
9 11
80.
482
249
118
6 70
11
968
131
224
960
137
40 1
666
723
1 97
3 09
10.
250.
732.
040.
990.
990.
870.
881.
00
Spai
n 46
2 08
245
1 68
545
1 68
52
180
449
505
0.48
444
288
6 21
439
9 05
515
38
998
180
10 9
736
367
408
053
2.21
2.68
2.69
0.97
0.97
0.86
0.92
0.99
Swed
en
100
940
100
940
100
940
1 36
099
580
1.35
100
578
4 41
694
338
174
3 34
932
671
4 59
097
687
3.43
4.73
0.69
0.95
0.95
0.93
0.97
1.01
Switz
erla
nd
81 3
5079
232
79 2
3295
478
278
1.20
97 1
626
100
72 0
1062
822
150
1 99
86
162
72 8
331.
132.
322.
740.
980.
980.
890.
751.
24
Uni
ted
Kin
gdom
573
1 74
370
5 87
566
9 87
517
674
652
201
2.64
654
095
9 34
064
3 04
121
915
990
261
13 4
139
559
659
030
2.43
5.00
2.04
0.95
0.90
0.88
1.01
1.00
Engl
and6
603
100
580
546
580
546
16 1
2156
4 42
52.
7856
7 20
44
120
560
248
133
15 3
0910
010
973
4 25
357
5 55
82.
665.
361.
910.
950.
950.
931.
011.
00
Nor
ther
n Ir
elan
d 26
043
26 4
7726
477
436
26 0
411.
6526
129
2 84
925
757
8668
158
447
2 93
526
438
2.57
4.18
1.69
0.96
0.96
0.99
1.01
1.00
Scot
land
65
200
62 8
5262
852
1 11
761
735
1.78
60 7
622
371
57 0
350
010
31
993
2 37
157
035
0.00
1.78
3.49
0.98
0.98
0.87
0.94
0.98
Uni
ted
Stat
es
3 87
6 00
03
836
000
3 83
6 00
00
3 83
6 00
00.
003
567
961
3 84
63
121
874
211
132
543
221
148
374
4 05
73
254
417
4.07
4.07
4.56
0.96
0.96
0.81
0.91
0.93
Not
es:
1.T
he s
ampl
ing
form
num
bers
for
Bel
gium
exc
eed
the
sum
of
the
two
part
s be
caus
e th
e G
erm
an-s
peak
ing
com
mun
ity
in B
elgi
um i
s al
so i
nclu
ded
in t
hese
num
bers
.2.
Bra
zil’s
elig
ible
pop
ulat
ion
cons
ists
of
15-y
ear-
olds
in
grad
es 7
to
10.
The
las
t tw
o co
vera
ge i
ndic
es a
re n
ot a
vaila
ble
beca
use
Sis
an
over
esti
mat
e of
the
non
-gra
de 5
and
6 p
opul
atio
n (i
t on
lyex
clud
es s
choo
ls w
ith
only
gra
des
5 an
d 6)
.3.
Hun
gary
use
d gr
ade
9 en
rolm
ents
for
pri
mar
y sc
hool
s, w
hich
was
not
a g
ood
mea
sure
of
size
. T
here
fore
, th
e va
lue
of S
has
been
rec
alcu
late
d as
122
609
.74
(sum
of
scho
ol b
ase
wei
ghts
wei
ghte
d by
MO
Sfo
r th
e se
cond
ary
scho
ols)
plu
s 8
681
(sum
of
stud
ent
wei
ghts
fro
m t
he p
rim
ary
scho
ols
stra
tum
)=13
1 29
0.74
4.P
rim
ary
scho
ols
in P
olan
d w
ere
not
rand
omly
sam
pled
and
the
refo
re t
hese
stu
dent
s co
uld
not
be u
sed.
As
a re
sult
, th
e 6.
7 pe
r ce
nt o
f th
e po
pula
tion
acc
ount
ed f
or b
y th
ese
stud
ents
are
adde
d to
the
sch
ool-
leve
l ex
clus
ion
figu
re.
5.T
he S
ampl
ing
Form
2 n
umbe
rs f
or t
he U
nite
d K
ingd
om e
xcee
d th
e su
m o
f th
e th
ree
part
s be
caus
e W
ales
is
also
inc
lude
d in
the
se n
umbe
rs.
Thi
s al
so a
ffec
ts I
ndic
es 2
and
3 f
or t
he U
nite
dK
ingd
om.
6.O
n Sa
mpl
ing
Form
3,
the
wit
hin-
scho
ol e
xclu
sion
fig
ure
for
Eng
land
was
1 2
58.
Aft
er s
ampl
ing,
spe
cial
edu
cati
on s
choo
ls w
ere
also
exc
lude
d. T
hey
acco
unt
for
anot
her
14 8
63 f
or a
new
tota
l sc
hool
-lev
el e
xclu
sion
of
16 1
21.
7.(a
) St
uden
ts i
n vo
cati
onal
sch
ools
are
enr
olle
d on
a p
art-
tim
e/pa
rt-y
ear
basi
s. (
b) S
ince
the
PIS
A a
sses
smen
t w
as c
ondu
cted
onl
y at
one
poi
nt i
n ti
me,
the
se s
tude
nts
are
capt
ured
onl
y pa
rtia
lly.
(c)
It i
s no
t po
ssib
le t
o as
sess
how
wel
l th
e st
uden
ts s
ampl
ed f
rom
voc
atio
nal
scho
ols
repr
esen
t th
e un
iver
se o
f st
uden
ts e
nrol
led
in v
ocat
iona
l sc
hool
s, a
nd s
o th
ose
stud
ents
not
att
endi
ngcl
asse
s at
the
tim
e of
the
PIS
A a
sses
smen
t ar
e no
t re
pres
ente
d in
the
PIS
A r
esul
ts.
For calculating school response rates afterreplacement, the numerator consisted of all sampleschools (original plus replacement) with enrolledage-eligible students that participated (i.e.,assessed a sample of eligible students and obtaineda student response rate of at least 50 per cent).The denominator consisted of all the schools inthe numerator that were in the original sample,plus those original sample schools that had age-eligible students enrolled, but that failed toassess at least 50 per cent of eligible samplestudents and for which no replacement schoolparticipated. Schools that were included in thesampling frame, but were found to contain noage-eligible students, were omitted from thecalculation of response rates. Replacementschools were included only when theyparticipated, and were replacing a refusingschool that had age-eligible students.
In calculating weighted school response rates,each school received a weight equal to theproduct of its base weight (the reciprocal of itsselection probability) and the number of age-eligible students enrolled, as indicated on thesampling frame.
With the use of probability proportional-to-size sampling, in countries with few certaintyschool selections and no over-sampling or under-sampling of any explicit strata, weighted andunweighted rates are very similar.
Thus, the weighted school response ratebefore replacement is given by the formula:
, (79)
where Y denotes the set of responding originalsample schools with age-eligible students, Ndenotes the set of eligible non-respondingoriginal sample schools, Wi denotes the baseweight for school i, Wi = 1/Pi, where Pi denotesthe school selection probability for school i, andEi denotes the enrolment size of age-eligiblestudents, as indicated on the sampling frame.
The weighted school response rate, afterreplacement, is given by the formula:
, (80)
where Y denotes the set of responding originalsample schools, R denotes the set of respondingreplacement schools, for which thecorresponding original sample school waseligible but was non-responding, N denotes theset of eligible refusing original sample schools,Wi denotes the base weight for school i, Wi = 1/Pi, where Pi denotes the school selectionprobability for school i, and for weighted rates,Ei denotes the enrolment size of age-eligiblestudents, as indicated on the sampling frame.
For unweighted student response rates, thenumerator is the number of students for whomassessment data were included in the results. Thedenominator is the number of sampled studentswho were age-eligible, and not explicitlyexcluded as student exclusions. The exception iscases where countries applied different samplingrates across explicit strata. In these cases,unweighted rates were calculated in eachstratum, and then weighted together accordingto the relative population size of 15-year-olds ineach stratum.
For weighted student response rates, the samenumber of students appear in the numerator anddenominator as for unweighted rates, but eachstudent was weighted by its student base weight.This is given as the product of the school baseweight—for the school in which the student isenrolled—and the reciprocal of the studentselection probability within the school.
In countries with no over-sampling of anyexplicit strata, weighted and unweighted studentparticipation rates are very similar.
Overall response rates are calculated as theproduct of school and student response rates.Although overall weighted and unweighted ratescan be calculated, there is little value in presentingoverall unweighted rates. The weighted ratesindicate the proportion of the student populationrepresented by the sample prior to making theschool and student non-response adjustments.
=
WiEii∈ (Y ∪ R )∑
WiEii∈ (Y ∪ N )∑
weighted school responserate after replacement
weighted school responserate before replacement
=WiEi
i∈ Y∑
WiEii∈ (Y ∪ N )∑
Sam
pli
ng
outc
omes
137
PIS
A 2
00
0 t
ech
nic
al r
epor
t
138
Table 32: School Response Rates Before Replacements
Country Weighted School Number of Number of Number of Number ofParticipation Responding Schools Responding RespondingRate Before Schools Sampled Schools and Non-
Replacement (%) (Weighted by (Responding and (Unweighted) RespondingEnrolment) Non-Responding) Schools
(Weighted by (Unweighted)Enrolment)
Australia 80.95 197 639 244 157 206 247Austria 99.38 86 062 86 601 212 213Belgium 69.12 81 453 117 836 171 252
Belgium (Fl.) 61.51 42 506 69 110 91 150Belgium (Fr.) 79.93 38 947 48 726 80 102
Brazil 97.38 2 425 608 2 490 788 315 324Canada 87.91 335 100 381 165 1 069 1 159Czech Republic 95.30 123 345 129 422 217 229Denmark 83.66 42 027 50 236 194 235Finland 96.82 63 783 65 875 150 155France 94.66 704 971 744 754 173 184Germany 94.71 885 792 935 222 213 220Greece 83.91 92 824 110 622 98 141Hungary 98.67 209 153 211 969 193 195Iceland 99.88 4 015 4 020 130 131Ireland 85.56 53 164 62 138 132 154Italy 97.90 550 932 562 763 167 170Japan 82.05 1 165 576 1 420 533 123 150Korea 100.00 589 018 589 018 146 146Latvia 82.39 29 354 35 628 143 174Liechtenstein 100.00 327 327 11 11Luxembourg 93.04 3 852 4 140 23 25Mexico 92.69 985 745 1 063 524 170 182Netherlands 27.13 49 019 180 697 49 180New Zealand 77.65 39 328 50 645 136 178Norway 85.95 43 207 50 271 164 191Poland 79.11 432 603 546 842 119 150Portugal 95.27 120 521 126 505 145 152Russian Federation 98.84 4 445 841 4 498 235 237 242Spain 95.41 423 900 444 288 177 185Sweden 99.96 100 534 100 578 159 160Switzerland 91.81 89 208 97 162 271 311United Kingdom 61.27 400 737 654 095 292 430
England 58.83 333 674 567 204 106 180Northern Ireland 70.90 18 525 26 129 100 142Scotland 79.88 48 537 60 762 86 108
United States 56.42 2 013 101 3 567 961 116 210
Table 33: School Response Rates After Replacements
Country Weighted School Number of Number of Number of Number ofParticipation Responding Schools Responding RespondingRate After Schools Sampled Schools and Non-
Replacement (%) (Weighted by (Responding and (Unweighted) RespondingEnrolment) Non-Responding) Schools
(Weighted by (Unweighted)Enrolment)
Australia 93.65 228 668 244 175 228 247Austria 100.00 86 601 86 601 213 213Belgium 85.52 100 833 117 911 214 252
Belgium (Fl.) 79.75 55 144 69 142 119 150Belgium (Fr.) 93.68 45 689 48 770 95 102
Brazil 97.96 2 439 152 2 489 942 318 324Canada 93.31 355 644 381 161 1 098 1 159Czech Republic 99.01 128 551 129 841 227 229Denmark 94.86 47 689 50 271 223 235Finland 100.00 65 875 65 875 155 155France 95.23 709 454 744 982 174 184Germany 94.71 885 792 935 222 213 220Greece 99.77 130 555 130 851 139 141Hungary 98.67 209 153 211 969 193 195Iceland 99.88 4 015 4 020 130 131Ireland 87.53 54 388 62 138 135 154Italy 100.00 562 755 562 755 170 170Japan 90.05 1 279 121 1 420 533 135 150Korea 100.00 589 018 589 018 146 146Latvia 88.51 31 560 35 656 153 174Liechtenstein 100.00 327 327 11 11Luxembourg 93.04 3 852 4 140 23 25Mexico 100.00 1 063 524 1 063 524 182 182Netherlands 55.50 100 283 180 697 100 180New Zealand 86.37 43 744 50 645 152 178Norway 92.25 46 376 50 271 176 191Poland 83.21 455 870 547 847 126 150Portugal 95.27 120 521 126 505 145 152Russian Federation 99.29 4 466 335 4 498 235 238 242Spain 100.00 444 288 444 288 185 185Sweden 99.96 100 534 100 578 159 160Switzerland 95.84 92 888 96 924 282 311United Kingdom 82.14 537 219 654 022 349 430
England 82.32 466 896 567 204 148 180Northern Ireland 79.37 20 755 26 150 113 142Scotland 81.70 49 568 60 668 88 108
United States 70.33 2 503 666 3 559 661 145 210
Sam
pli
ng
outc
omes
139
PIS
A 2
00
0 t
ech
nic
al r
epor
t
140
Table 34: Student Response Rates After Replacements
Country Weighted Student Number Number Number of Number ofParticipation of Students of Students Students Assessed Students
Rate After Second Assessed Sampled (Unweighted) SampledReplacement (%) (Weighted) (Assessed and (Assessed and
Absent) Absent)(Weighted) (Unweighted)
Australia 84.24 161 607 191 850 5 154 6 173
Austria 91.64 65 562 71 547 4 745 5 164
Belgium 93.30 88 816 95 189 6 648 7 103
Belgium (Fl.) 95.49 46 801 49 012 3 874 4 055
Belgium (Fr.) 90.98 42 014 46 177 2 774 3 048
Brazil 87.15 1 463 000 1 678 789 4 885 5 613
Canada 84.89 276 233 325 386 29 461 33 736
Czech Republic 92.76 115 371 124 372 5 343 5 769
Denmark 91.64 37 171 40 564 4 212 4 592
Finland 92.80 58 303 62 826 4 864 5 237
France 91.19 634 276 695 523 4 657 5 115
Germany 85.65 666 794 778 516 4 983 5 788
Greece 96.83 136 919 141 404 4 672 4 819
Hungary 95.31 100 807 105 769 4 883 5 111
Iceland 87.09 3 372 3 872 3 372 3 872
Ireland 85.59 42 088 49 172 3 786 4 424
Italy 93.08 475 446 510 792 4 984 5 369
Japan 96.34 1 267 367 1 315 462 5 256 5 450
Korea 98.84 572 767 579 470 4 982 5 045
Latvia 90.73 24 403 26 895 3 915 4 305
Liechtenstein 96.62 314 325 314 325
Luxembourg 89.19 3 434 3 850 3 434 3 850
Mexico 93.95 903 100 961 283 4 600 4 882
Netherlands 84.03 72 656 86 462 2 503 2 958
New Zealand 88.23 35 616 40 369 3 667 4 163
Norway 89.28 40 908 45 821 4 147 4 665
Poland 87.70 393 675 448 904 3 639 4 169
Portugal 86.28 82 395 95 493 4 517 5 232
Russian Federation 96.21 1 903 348 1 978 266 6 701 6 981
Spain 91.78 366 301 399 100 6 214 6 764
Sweden 87.96 82 956 94 312 4 416 5 017
Switzerland 95.13 65 677 69 037 6 084 6 389
United Kingdom 80.97 419 713 518 358 9 250 11 300
England 81.07 366 277 451 828 4 099 5 047
Northern Ireland 86.01 17 049 19 821 2 825 3 281
Scotland 77.90 36 387 46 708 2 326 2 972
United States 84.99 1 801 229 2 119 392 3 700 4 320
Design Effect andEffective Sample Size
Surveys in education and especially internationalsurveys rarely sample students by simplyselecting a random sample of students (a simplerandom sample). Schools are first selected and,within each selected school, classes or studentsare randomly sampled. Sometimes, geographicareas are first selected before sampling schoolsand students. This sampling design is usuallyreferred to as a cluster sample or a multi-stagesample.
Selected students attending the same schoolcannot be considered as independentobservations as they can be with a simplerandom sample because they are usually moresimilar than students attending distincteducational institutions. For instance, they areoffered the same school resources, may have thesame teachers and therefore are taught acommon implemented curriculum, and so on.School differences are also larger if differenteducational programs are not available in allschools. One expects to observe greaterdifferences between a vocational school and anacademic school than between twocomprehensive schools.
Furthermore, it is well known that within acountry, within sub-national entities and withina city, people tend to live in areas according totheir financial resources. As children usuallyattend schools close to their house, it is likelythat students attending the same school comefrom similar social and economic backgrounds.
A simple random sample of 4 000 students isthus likely to cover the diversity of thepopulation better than a sample of 100 schoolswith 40 students observed within each school. Itfollows that the uncertainty associated with anypopulation parameter estimate (i.e., standarderror) will be larger for a clustered sample thanfor a simple random sample of the same size.
To limit this reduction of precision in thepopulation parameter estimate, multi-stagesample designs usually use complementaryinformation to improve coverage of thepopulation diversity. In PISA, and in previousinternational surveys, the following techniqueswere implemented to limit the increase instandard errors: (i) explicit and/or implicitstratification of the school sample frame and
(ii) selection of the schools with probabilitiesproportional to their size. Complementaryinformation generally cannot fully compensatefor the increase in the standard errors due to themulti-stage design, however.
The reduction in design efficiency is usuallyreported through the ‘design effect’ (Kish, 1965)that describes the ‘ratio of the variance of theestimate obtained from the (more complex)sample to the variance of the estimate that wouldhave been obtained from a simple random sampleof the same number of units. The design effect hastwo primary uses in sample size estimation andin appraising the efficiency of more complexplans.’ (Cochran, 1977).
In PISA, a design effect has been computedfor a statistic t using:
, (81)
where is the sampling variance for thestatistic t computed by the BRR replicationmethod (see Chapter 8), and is thesampling variance for the same statistic t on thesame data base but considering the sample as asimple random sample.
Another way to express the lack of precisiondue to the complex sampling design is throughthe effective sample size, which expresses thesimple random sample size that would give thesame standard error as the one obtained fromthe actual complex sample design. In PISA, theeffective sample size for statistic t is equal to:
, (82)
where n is equal to the actual number of units inthe sample.
The notion of design effect as given in (81) isextended and produces five design effects todescribe the influence of the sampling and testdesigns on the standard errors for statistics.
The total errors computed for the internationalPISA initial report consist of two components:sampling variance and measurement variance. Thestandard error in PISA is inflated because thestudents were not sampled according to a simplerandom sample and also because the measure ofthe student proficiency estimates includes someamount of random error.
Effn tn
Deff t
n Var t
Var tSRS
BRR3
3( )
( )
( )
( )= =
×
Var tSRS ( )
Var tBRR ( )
Deff tVar t
Var tBRR
SRS3 ( )
( )
( )=
Sam
pli
ng
outc
omes
141
For any statistic t, the population estimateand the sampling variance are computed foreach plausible value and then combined asdescribed in Chapter 9.
The five design effects and their respectiveeffective sample sizes that are considered aredefined as follows:
, (83)
where is the measurement variance forthe statistic t. This design effect shows the inflationof the total variance that would have occurreddue to measurement error if in fact the samplewere considered a simple random sample.
(84)
shows the inflation of the total variance due onlyto the use of the complex sampling design.
(85)
shows the inflation of the sampling variance dueto the use of the complex design.
(86)
shows the inflation of the total variance due tothe measurement error.
(87)
shows the inflation of the total variance due tothe measurement error and due to the complexsampling design.
The product of the first and second designeffects is equal to the product of the third andfourth design effects, and both products areequal to the fifth design effect.
Tables 35 to 39 provide the design effects andthe effective sample sizes, respectively, for themeans of the combined reading, mathematicaland scientific literacy scales, the percentage ofstudents at level 3 on the combined readingliteracy scale, and the percentage of students atlevel 5 on the combined reading literacy scale.
The results show that the design effectsdepend on the computed statistics. It is wellknown that the design effects due to multi-stagesample designs are usually high for statisticssuch as means but much lower for relationalstatistics such as correlation or regressioncoefficients.
Because the samples for the mathematics andscience scales are drawn from the same schoolsas that for the combined reading scale, but withmuch fewer students, it follows that the readingsample is much more clustered than for thescience and mathematics samples. Therefore it isnot surprising to find that design effects aregenerally substantially higher for reading thanfor mathematics and science.
The design effect due to the multi-stagesample is generally quite small for the percentageof students at level 3 but generally higher for thepercentage of students at level 5. Recoding thestudent proficiency estimates into being or notbeing at level 3 suppresses the distinctionbetween high and low achievers and thereforeconsiderably reduces the school variance.Recoding student proficiency estimates intobeing or not being at level 5, however, keeps thedistinction between high and lower achieversand preserves some of the school variance.
The measurement error for the minordomains is not substantially higher than themeasurement error for the major domainbecause the proficiency estimates were generatedwith a multi-dimensional model using a large setof variables as conditioning variables. Thiscomplementary information has effectivelyreduced the measurement error for the minordomain proficiency estimates.Deff t
Var t MVar t
Var tBRR
SRS5 ( )
( ) ( )
( )=
+
Deff tVar t MVar t
Var tBRR
BRR4 ( )
( ) ( )
( )=
+
Deff tVar t
Var tBRR
SRS3 ( )
( )
( )=
Deff tVar t MVar t
Var t MVar tBRR
SRS2 ( )
( ) ( )
( ) ( )=
++
MVar t( )
Deff tVar t MVar t
Var tSRS
SRS1( )
( ) ( )
( )=
+
PIS
A 2
00
0 t
ech
nic
al r
epor
t
142
Sam
pli
ng
outc
omes
143
Tab
le 3
5:
Des
ign
Eff
ects
and
Eff
ecti
ve S
ampl
e Si
zes
for
the
Mea
n Pe
rfor
man
ce o
n th
e C
ombi
ned
Rea
ding
Lit
erac
y Sc
ale
Des
ign
Des
ign
Des
ign
Des
ign
Des
ign
Eff
ecti
ve
Eff
ecti
veE
ffec
tive
Eff
ecti
ve
Eff
ecti
veE
ffec
t 1
Eff
ect
2E
ffec
t 3
Eff
ect
4E
ffec
t 5
Sam
ple
Sam
ple
Sam
ple
Sam
ple
Sam
ple
Size
1Si
ze 2
Size
3Si
ze 4
Size
5
Aus
tral
ia1.
304.
775.
901.
056.
203
983
1 08
587
74
926
835
Aus
tria
1.06
2.98
3.10
1.02
3.16
4 48
31
590
1 53
14
657
1 50
2B
elgi
um1.
066.
967.
311.
017.
376
302
958
912
6 61
790
5B
razi
l1.
195.
326.
141.
036.
334
112
920
797
4 74
677
3C
anad
a1.
097.
417.
971.
018.
0627
294
4 00
93
726
29 3
643
686
Cze
ch R
epub
lic1.
073.
043.
181.
023.
255
019
1 76
61
688
5 25
11
652
Den
mar
k1.
082.
262.
361.
032.
443
924
1 87
51
796
4 09
71
737
Finl
and
1.14
3.55
3.90
1.04
4.04
4 27
01
370
1 24
64
697
1 20
3Fr
ance
1.12
3.70
4.02
1.03
4.13
4 18
91
262
1 16
44
542
1 13
1G
erm
any
1.13
2.20
2.36
1.06
2.49
4 47
32
309
2 15
24
800
2 03
6G
reec
e1.
1910
.29
12.0
41.
0212
.23
3 93
045
438
84
600
382
Hun
gary
1.03
8.41
8.64
1.00
8.67
4 74
358
156
64
870
564
Icel
and
1.11
0.75
0.73
1.15
0.84
3 04
54
470
4 63
32
936
4 03
7Ir
elan
d1.
114.
164.
501.
024.
613
474
927
856
3 76
283
6It
aly
1.16
4.35
4.90
1.03
5.06
4 28
01
147
1 01
84
822
985
Japa
n1.
1117
.53
19.2
81.
0119
.38
4 75
330
027
35
227
271
Kor
ea1.
135.
335.
891.
026.
024
413
935
846
4 87
582
8L
atvi
a1.
208.
6210
.16
1.02
10.3
63
240
451
383
3 81
737
6L
iech
tens
tein
1.10
0.52
0.48
1.20
0.57
286
600
658
261
547
Lux
embo
urg
1.16
0.77
0.73
1.22
0.89
3 04
34
603
4 83
82
893
3 97
0M
exic
o1.
175.
886.
691.
026.
853
945
783
688
4 48
967
1N
ethe
rlan
ds1.
063.
393.
521.
023.
582
369
739
711
2 46
369
9N
ew Z
eala
nd1.
032.
352.
401.
012.
433
549
1 56
01
531
3 61
71
510
Nor
way
1.06
2.85
2.97
1.02
3.03
3 89
51
457
1 39
84
058
1 36
9Po
land
1.16
6.29
7.12
1.02
7.28
3 15
858
151
33
575
502
Port
ugal
1.20
8.30
9.72
1.02
9.91
3 83
655
347
24
495
462
Rus
sian
Fed
erat
ion
1.16
11.7
913
.53
1.01
13.6
95
771
568
495
6 62
249
0Sp
ain
1.17
5.44
6.18
1.03
6.35
5 32
31
143
1 00
56
050
979
Swed
en1.
202.
102.
321.
092.
523
669
2 10
61
903
4 05
91
749
Swit
zerl
and
1.05
10.0
410
.52
1.00
10.5
75
798
607
580
6 07
057
7U
nite
d K
ingd
om1.
095.
555.
971.
026.
078
552
1 68
21
564
9 19
81
540
Uni
ted
Stat
es1.
1015
.82
17.2
91.
0117
.39
3 50
024
322
23
824
221
PIS
A 2
00
0 t
ech
nic
al r
epor
t
144
Tab
le 3
6:
Des
ign
Eff
ects
and
Eff
ecti
ve S
ampl
e Si
zes
for
the
Mea
n Pe
rfor
man
ce o
n th
e M
athe
mat
ical
Lit
erac
y Sc
ale
Des
ign
Des
ign
Des
ign
Des
ign
Des
ign
Eff
ecti
ve
Eff
ecti
veE
ffec
tive
Eff
ecti
ve
Eff
ecti
veE
ffec
t 1
Eff
ect
2E
ffec
t 3
Eff
ect
4E
ffec
t 5
Sam
ple
Sam
ple
Sam
ple
Sam
ple
Sam
ple
Size
1Si
ze 2
Size
3Si
ze 4
Size
5
Aus
tral
ia1.
492.
893.
811.
134.
291
923
991
751
2 53
466
6A
ustr
ia1.
011.
931.
931.
001.
942
620
1 37
01
365
2 63
01
360
Bel
gium
1.12
4.54
4.98
1.03
5.10
3 36
683
476
13
692
742
Bra
zil
1.25
3.14
3.68
1.07
3.93
2 17
586
473
92
544
692
Can
ada
1.12
4.05
4.42
1.03
4.55
14 6
824
072
3 72
616
041
3 62
6C
zech
Rep
ublic
1.03
2.46
2.51
1.01
2.55
2 96
41
246
1 22
13
025
1 20
4D
enm
ark
1.23
1.53
1.65
1.14
1.88
1 93
61
556
1 44
02
090
1 26
4Fi
nlan
d1.
251.
541.
681.
151.
932
163
1 75
11
610
2 35
21
402
Fran
ce1.
211.
992.
191.
092.
402
153
1 30
51
184
2 37
31
082
Ger
man
y1.
061.
621.
651.
031.
712
682
1 74
71
711
2 73
81
656
Gre
ece
1.24
5.60
6.68
1.04
6.91
2 10
846
639
02
516
377
Hun
gary
1.04
4.53
4.66
1.01
4.69
2 70
161
860
12
777
597
Icel
and
1.25
1.06
1.08
1.23
1.33
1 50
51
768
1 74
11
527
1 41
4Ir
elan
d1.
072.
092.
171.
032.
251
984
1 01
697
92
059
948
Ital
y1.
322.
212.
591.
122.
912
101
1 25
01
066
2 46
495
0Ja
pan
1.10
10.6
011
.57
1.01
11.6
72
655
276
253
2 89
925
0K
orea
1.12
2.65
2.84
1.04
2.97
2 47
01
047
974
2 65
693
4L
atvi
a1.
183.
403.
831.
054.
001
826
632
562
2 05
453
7L
iech
tens
tein
1.15
0.81
0.78
1.19
0.93
153
216
224
147
189
Lux
embo
urg
1.11
0.81
0.79
1.14
0.90
1 76
12
415
2 48
01
713
2 17
0M
exic
o1.
183.
604.
061.
044.
232
181
714
633
2 46
060
6N
ethe
rlan
ds1.
082.
172.
271.
042.
351
280
636
610
1 33
458
9N
ew Z
eala
nd1.
141.
821.
931.
072.
071
793
1 12
81
060
1 90
898
8N
orw
ay1.
241.
701.
871.
132.
111
857
1 35
71
234
2 04
21
093
Pola
nd1.
085.
205.
561.
025.
641
823
380
356
1 94
735
0Po
rtug
al1.
104.
634.
981.
025.
072
323
550
511
2 49
750
2R
ussi
an F
eder
atio
n1.
158.
9010
.09
1.02
10.2
43
232
418
369
3 66
436
3Sp
ain
1.03
3.96
4.04
1.01
4.07
3 33
086
684
83
403
841
Swed
en1.
121.
531.
591.
071.
712
207
1 60
91
546
2 29
51
441
Swit
zerl
and
1.20
5.49
6.37
1.03
6.57
2 84
161
853
33
295
517
Uni
ted
Kin
gdom
1.
173.
313.
701.
053.
864
450
1 57
01
406
4 96
81
345
Uni
ted
Stat
es
1.10
11.7
712
.79
1.01
12.8
91
950
181
167
2 11
916
6
Tab
le 3
7:
Des
ign
Eff
ects
and
Eff
ecti
ve S
ampl
e Si
zes
for
the
Mea
n Pe
rfor
man
ce o
n th
e Sc
ient
ific
Lit
erac
y Sc
ale
Des
ign
Des
ign
Des
ign
Des
ign
Des
ign
Eff
ecti
ve
Eff
ecti
veE
ffec
tive
Eff
ecti
ve
Eff
ecti
veE
ffec
t 1
Eff
ect
2E
ffec
t 3
Eff
ect
4E
ffec
t 5
Sam
ple
Sam
ple
Sam
ple
Sam
ple
Sam
ple
Size
1Si
ze 2
Size
3Si
ze 4
Size
5
Aus
tral
ia1.
203.
223.
671.
063.
882
374
889
779
2 70
973
8A
ustr
ia1.
071.
952.
011.
032.
082
500
1 37
01
327
2 58
21
284
Bel
gium
1.03
5.39
5.53
1.01
5.56
3 61
369
067
43
702
670
Bra
zil
1.63
2.16
2.90
1.22
3.53
1 66
01
253
935
2 22
076
8C
anad
a1.
104.
705.
061.
025.
1515
047
3 50
63
260
16 1
813
199
Cze
ch R
epub
lic1.
081.
901.
971.
042.
052
841
1 61
11
554
2 94
61
495
Den
mar
k1.
041.
671.
701.
021.
742
256
1 40
51
383
2 29
21
351
Finl
and
1.24
1.80
1.99
1.12
2.23
2 18
01
510
1 36
32
414
1 21
4Fr
ance
1.25
2.01
2.26
1.11
2.50
2 08
01
290
1 14
82
337
1 03
6G
erm
any
1.22
1.33
1.41
1.16
1.63
2 34
12
142
2 03
12
466
1 75
7G
reec
e1.
026.
516.
601.
006.
612
553
398
393
2 58
739
2H
unga
ry1.
054.
424.
581.
014.
622
678
633
612
2 77
260
6Ic
elan
d1.
031.
101.
111.
031.
141
804
1 68
41
679
1 80
91
634
Irel
and
1.02
2.52
2.55
1.01
2.56
2 09
784
783
82
119
833
Ital
y1.
052.
542.
621.
022.
682
629
1 08
71
054
2 71
21
033
Japa
n1.
179.
1210
.50
1.02
10.6
72
489
320
277
2 86
727
3K
orea
1.22
2.52
2.85
1.08
3.07
2 26
41
095
968
2 56
189
9L
atvi
a1.
056.
807.
081.
017.
132
059
317
305
2 14
230
3L
iech
tens
tein
1.04
0.95
0.95
1.04
0.99
170
185
185
169
178
Lux
embo
urg
1.15
0.98
0.98
1.15
1.13
1 69
81
983
1 98
81
691
1 72
7M
exic
o1.
193.
664.
151.
044.
342
149
696
613
2 43
958
7N
ethe
rlan
ds1.
022.
322.
351.
012.
381
364
601
593
1 38
258
7N
ew Z
eala
nd1.
031.
121.
121.
021.
151
974
1 81
11
805
1 98
01
762
Nor
way
1.06
1.81
1.85
1.03
1.91
2 18
11
279
1 24
62
237
1 20
8Po
land
1.43
3.99
5.28
1.08
5.72
1 42
551
338
71
888
357
Port
ugal
1.03
4.98
5.11
1.01
5.14
2 47
151
349
92
536
496
Rus
sian
Fed
erat
ion
1.14
7.42
8.34
1.02
8.48
3 25
250
144
63
656
438
Spai
n1.
043.
193.
271.
013.
313
339
1 08
31
057
3 42
01
046
Swed
en1.
131.
571.
641.
081.
772
163
1 55
81
488
2 26
51
379
Swit
zerl
and
1.29
5.18
6.40
1.05
6.70
2 62
665
653
13
248
507
Uni
ted
Kin
gdom
1.26
3.07
3.61
1.07
3.88
4 09
91
687
1 43
34
826
1 33
6U
nite
d St
ates
1.12
9.91
11.0
11.
0111
.13
1 89
421
519
32
105
191
Sam
pli
ng
outc
omes
145
Tab
le 3
8:
Des
ign
Eff
ects
and
Eff
ecti
ve S
ampl
e Si
zes
for
the
Perc
enta
ge o
f St
uden
ts a
t L
evel
3 o
n th
e C
ombi
ned
Rea
ding
Lit
erac
y Sc
ale
Des
ign
Des
ign
Des
ign
Des
ign
Des
ign
Eff
ecti
ve
Eff
ecti
veE
ffec
tive
Eff
ecti
ve
Eff
ecti
veE
ffec
t 1
Eff
ect
2E
ffec
t 3
Eff
ect
4E
ffec
t 5
Sam
ple
Sam
ple
Sam
ple
Sam
ple
Sam
ple
Size
1Si
ze 2
Size
3Si
ze 4
Size
5
Aus
tral
ia1.
971.
602.
191.
473.
162
626
3 22
82
366
3 52
21
639
Aus
tria
2.22
1.41
1.92
1.64
3.13
2 14
13
356
2 47
62
897
1 51
5B
elgi
um1.
571.
702.
091.
272.
654
259
3 93
53
193
5 23
72
512
Bra
zil
2.27
2.10
3.50
1.37
4.77
2 15
62
327
1 39
83
584
1 02
6C
anad
a1.
771.
832.
461.
323.
2316
796
16 2
4912
061
22 5
169
194
Cze
ch R
epub
lic2.
911.
121.
362.
413.
271
847
4 77
03
938
2 22
21
642
Den
mar
k1.
781.
141.
251.
632.
032
383
3 71
53
392
2 59
82
091
Finl
and
1.42
1.08
1.11
1.38
1.53
3 42
64
523
4 39
43
518
3 18
6Fr
ance
1.15
1.81
1.93
1.08
2.07
4 07
62
584
2 42
54
340
2 25
4G
erm
any
1.79
1.42
1.75
1.47
2.54
2 83
13
583
2 90
63
462
1 99
9G
reec
e3.
371.
984.
311.
556.
671
388
2 35
51
084
3 00
770
0H
unga
ry1.
572.
623.
541.
164.
113
110
1 86
81
380
4 20
61
188
Icel
and
1.09
1.14
1.15
1.08
1.24
3 10
12
961
2 92
93
132
2 72
3Ir
elan
d1.
941.
181.
341.
732.
281
983
3 28
02
877
2 23
11
688
Ital
y1.
311.
782.
021.
162.
343
795
2 80
22
464
4 29
02
134
Japa
n1.
472.
663.
431.
143.
903
582
1 97
71
530
4 62
61
347
Kor
ea1.
122.
102.
231.
052.
354
452
2 37
62
236
4 72
82
123
Lat
via
2.37
1.47
2.12
1.68
3.49
1 64
42
644
1 83
62
318
1 11
6L
iech
tens
tein
1.80
0.96
0.93
1.89
1.73
174
327
339
167
181
Lux
embo
urg
2.12
1.02
1.04
2.10
2.16
1 66
33
463
3 39
41
677
1 63
2M
exic
o1.
602.
523.
421.
184.
022
877
1 82
81
343
3 90
11
143
Net
herl
ands
1.80
2.10
2.98
1.28
3.79
1 38
71
192
839
1 95
766
1N
ew Z
eala
nd2.
221.
081.
182.
042.
401
650
3 39
03
105
1 79
31
525
Nor
way
1.37
1.04
1.06
1.36
1.43
3 02
83
976
3 91
73
056
2 90
4Po
land
1.44
2.24
2.78
1.16
3.22
2 54
61
630
1 31
43
151
1 13
6Po
rtug
al1.
512.
022.
541.
203.
043
043
2 26
91
807
3 81
71
506
Rus
sian
Fed
erat
ion
1.76
2.15
3.02
1.25
3.78
3 81
83
113
2 21
75
353
1 77
4Sp
ain
1.71
1.50
1.85
1.39
2.57
3 62
44
146
3 35
04
456
2 41
8Sw
eden
2.09
1.03
1.07
2.05
2.15
2 11
44
281
4 14
62
150
2 05
0Sw
itze
rlan
d1.
412.
302.
831.
153.
244
337
2 65
12
155
5 32
31
884
Uni
ted
Kin
gdom
2.39
1.58
2.39
1.60
3.78
3 90
95
904
3 90
75
854
2 47
1U
nite
d St
ates
1.97
1.60
2.18
1.46
3.15
1 95
72
401
1 76
32
638
1 22
2
PIS
A 2
00
0 t
ech
nic
al r
epor
t
146
Sam
pli
ng
outc
omes
147
Tab
le 3
9:
Des
ign
Eff
ects
and
Eff
ecti
ve S
ampl
e Si
zes
for
the
Perc
enta
ge o
f St
uden
ts a
t L
evel
5 o
n th
e C
ombi
ned
Rea
ding
Lit
erac
y Sc
ale
Des
ign
Des
ign
Des
ign
Des
ign
Des
ign
Eff
ecti
ve
Eff
ecti
veE
ffec
tive
Eff
ecti
ve
Eff
ecti
veE
ffec
t 1
Eff
ect
2E
ffec
t 3
Eff
ect
4E
ffec
t 5
Sam
ple
Sam
ple
Sam
ple
Sam
ple
Sam
ple
Size
1Si
ze 2
Size
3Si
ze 4
Size
5
Aus
tral
ia1.
334.
055.
041.
075.
373
900
1 27
91
026
4 85
496
4A
ustr
ia1.
951.
732.
421.
413.
372
433
2 74
41
960
3 36
61
407
Bel
gium
1.61
1.69
2.11
1.29
2.71
4 14
73
953
3 16
85
151
2 45
8B
razi
l4.
081.
231.
872.
834.
951
199
3 96
72
623
1 72
798
9C
anad
a1.
723.
124.
631.
165.
3517
308
9 51
86
407
25 7
015
550
Cze
ch R
epub
lic1.
281.
962.
221.
132.
494
206
2 74
32
418
4 76
52
151
Den
mar
k1.
361.
121.
161.
321.
523
112
3 78
73
649
3 21
42
784
Finl
and
1.86
1.44
1.81
1.47
2.67
2 62
13
382
2 68
23
299
1 82
2Fr
ance
1.50
1.21
1.31
1.38
1.80
3 12
63
876
3 57
53
376
2 59
3G
erm
any
1.32
1.25
1.33
1.25
1.65
3 83
34
062
3 81
64
072
3 06
9G
reec
e1.
203.
504.
001.
054.
203
892
1 33
41
167
4 44
51
112
Hun
gary
1.49
3.91
5.33
1.09
5.81
3 29
01
249
918
4 47
284
1Ic
elan
d2.
091.
001.
002.
102.
101
611
3 36
83
365
1 60
21
609
Irel
and
1.25
1.75
1.94
1.13
2.19
3 08
62
200
1 98
93
412
1 76
2It
aly
1.72
1.45
1.78
1.41
2.51
2 89
03
433
2 79
83
536
1 98
9Ja
pan
1.41
5.14
6.85
1.06
7.27
3 71
81
023
767
4 95
772
3K
orea
1.83
2.01
2.84
1.29
3.67
2 72
22
482
1 75
23
854
1 35
6L
atvi
a1.
243.
003.
471.
073.
713
138
1 30
01
121
3 63
01
048
Lie
chte
nste
in1.
850.
890.
812.
061.
6617
035
139
015
218
9L
uxem
bour
g1.
711.
121.
201.
601.
912
062
3 16
12
945
2 20
41
848
Mex
ico
1.66
1.81
2.34
1.29
3.00
2 76
62
539
1 96
53
566
1 53
1N
ethe
rlan
ds1.
341.
491.
651.
211.
991
875
1 67
91
513
2 07
51
258
New
Zea
land
1.41
1.64
1.90
1.22
2.31
2 60
62
237
1 93
03
018
1 58
9N
orw
ay1.
681.
231.
381.
502.
072
465
3 37
52
997
2 77
02
007
Pola
nd2.
122.
824.
851.
235.
961
727
1 29
775
42
963
613
Port
ugal
1.51
2.20
2.80
1.18
3.31
3 03
82
088
1 63
63
876
1 38
5R
ussi
an F
eder
atio
n1.
612.
954.
141.
154.
754
164
2 27
01
620
5 83
31
412
Spai
n2.
921.
261.
772.
113.
682
129
4 92
83
519
2 94
81
686
Swed
en1.
551.
401.
611.
342.
162
853
3 16
32
737
3 28
62
043
Swit
zerl
and
1.38
5.75
7.57
1.05
7.95
4 40
91
061
806
5 80
676
7U
nite
d K
ingd
om1.
524.
205.
861.
096.
386
147
2 22
31
593
8 57
21
463
Uni
ted
Stat
es1.
604.
115.
981.
106.
582
401
935
643
3 48
858
4
Scaling Outcomes
Ray Adams and Claus Carstensen
InternationalCharacteristics of theItem Pool
When main study data were received from eachparticipating country, they were first verified andcleaned using the procedures outlined in Chapter11. Files containing the achievement data wereprepared and national-level Rasch and traditionaltest analyses were undertaken. The results ofthese analyses were included in the reports thatwere returned to each participant.
After processing at the national level, a set ofinternational-level analyses was undertaken.Some involved summarising national analyses,while others required an analysis of theinternational data set.
The final international cognitive data set (thatis, the data set of coded achievement bookletresponses) (available as intcogn.txt) consisted of174 896 students from 32 participating countries.Table 40 shows the total number of sampledstudents, broken down by participating countryand test booklet.
Test Targeting
Each of the domains was separately scaled to exam-ine the targeting of the tests. Figures 22, 23 and 24show the match between the international itemdifficulty distribution and the internationaldistribution of student achievement for each ofreading, mathematics and science, respectively.1
The figures consist of three panels. The firstpanel, Students, shows the distribution ofstudents’ Rasch-scaled achievement estimates.
Students at the top end of this distribution havehigher achievement estimates than students atthe lower end of the distribution. The secondpanel, Item Difficulties, shows the distributionof Rasch-estimated item difficulties, and thethird panel, Step Difficulties, shows the itemsteps.
The number of steps for an item reported inthe right-hand panel is the number of scorecategories minus 2, the highest step beingreported for all items in the middle panel of thefigure. That leaves one step to be reported in theright-hand panel for a partial credit item scored0, 1, 2, and two steps to be reported for a partialcredit item scored 0, 1, 2, 3. The item score foreach of these items is shown in the right-handpanel in the item label, separated from the itemnumber by a dot point.
In Figure 22, the student achievementdistribution, shown by Xs, is located a littlehigher than the item difficulty distribution. Thisimplies that, on average, the students in the PISAmain study had an ability level that was abovethe level needed to have a 50 per cent chance ofsolving the average item correctly.
Figure 23 shows a plot of item difficulties andstep parameters, together with the studentachievement distribution for mathematics. Thefigure shows that the selected items for the mainstudy, on average, have about a 50 per centchance of being solved correctly by the averagestudent in the tested population.
Chapter 13
1 The results reported here are based on a domain-by-domain unweighted scaling of the full internationaldata set.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
150
Table 40: Number of Sampled Students by Country and Booklet
Country Booklet TotalSE 1 2 3 4 5 6 7 8 9
Australia 566 567 587 593 584 578 581 560 560 5 176Austria 24 539 525 499 525 511 528 527 532 535 4 745
Belgium 136 735 726 747 720 739 713 727 719 708 6 670
Brazil 545 547 545 539 556 553 537 532 539 4 893
Canada 3 246 3 312 3 326 3 243 3 291 3 307 3 336 3 311 3 315 29 687
Czech Republic 159 571 582 594 584 575 570 563 588 579 5 365
Denmark 474 452 482 478 469 459 464 473 484 4 235
Finland 5 546 548 545 531 527 546 536 545 535 4 864
France 501 506 521 517 531 525 528 520 524 4 673
Germany 47 549 574 573 567 551 556 556 556 544 5 073
Greece 516 509 532 522 517 522 514 526 514 4 672
Hungary 162 520 520 521 522 529 529 517 531 536 4 887
Iceland 379 376 380 377 380 363 374 376 367 3 372
Ireland 420 417 430 441 429 427 441 426 423 3 854
Italy 550 555 555 553 551 549 562 555 554 4 984
Japan 585 582 586 584 590 585 581 578 585 5 256
Korea 558 546 556 555 552 553 559 553 550 4 982
Latvia 427 430 430 435 436 436 443 428 428 3 893
Liechtenstein 34 37 36 34 33 33 35 35 37 314
Luxembourg 377 407 388 383 382 340 507 363 381 3 528
Mexico 524 519 515 516 513 498 500 511 504 4 600
Netherlands 23 280 281 267 280 282 282 278 264 266 2 503
New Zealand 426 408 413 405 395 402 404 416 398 3 667
Norway 459 471 465 458 465 461 450 461 457 4 147
Poland 390 429 407 425 408 418 406 410 361 3 654
Portugal 516 508 499 502 500 512 518 518 512 4 585
Russian Federation 742 744 752 748 745 747 743 740 740 6 701
Spain 685 685 682 710 699 700 691 679 683 6 214
Sweden 484 486 496 498 508 484 484 487 489 4 416
Switzerland 677 654 672 677 663 681 692 692 692 6 100
United Kingdom 1 024 1 049 1 042 1 021 1 051 1 031 1 044 1 040 1 038 9 340
United States 441 418 430 429 421 439 425 435 408 3 846
Total 556 19 286 19 370 19 473 19 372 19 383 19 327 19 523 19 360 19 246 174 896
151
+item +item*step-------------------------------------------------------------------------------------| | | | X| | | 3 | | | X|50 | | X| | | X|19 | | XX| | | XX| | | XXX| | | 2 XXXX|68 118 | | XXXXX| | | XXXXXX|38 116 | | XXXXXX|27 129 | | XXXXXXX|62 64 75 | | XXXXXXX|3 43 94 | | XXXXXXX|67 79 80 | | 1 XXXXXXXX|63 82 86 90 | | XXXXXXXXXX|25 28 32 37 77 92 93 99 |25.1 77.1 | XXXXXXX|42 59 117 122 | | XXXXXXXX|2 7 15 17 22 23 30 123 |16.1 | XXXXXXX|13 20 55 60 109 113 120 125 |84.1 110.1 | XXXXXXX|5 16 26 52 83 85 89 97 107 | | XXXXXXX|51 84 100 101 108 114 127 |80.1 | 0 XXXXXXXX|1 8 10 11 41 45 57 74 87 |20.1 90.1 | XXXXX|21 31 33 103 110 119 121 |50.1 82.1 | XXXXXX|12 36 95 112 124 | | XXXX|18 24 44 96 104 126 |15.1 75.1 | XXXX|40 78 88 91 |42.1 | XXXX|49 76 98 106 111 128 | | -1 XXX|4 9 29 69 72 73 |68.1 | XXX|35 48 53 54 58 71 |108.1 | XX|61 66 | | XX|6 |43.1 | X|34 39 65 70 81 102 115 | | X|46 47 56 | | X|14 | | -2 X|105 | | X| | | | | | | | | | | | | | | | | | -3 | | | | | |======================================================================================Each 'X' represents 1125.1 cases======================================================================================
Figure 22: Comparison of Reading Item Difficulty and Achievement Distributions
+item +item*step--------------------------------------------------------------------------------------- | | | | | | 3 | | | X| | | | | | X| | | XX| | | X| | | 2 XX| | | XXX|9 21 30 | | XXXX| | | XXXX|15 | | XXXXX|6 8 26 28 | | XXXXXX|11 | | 1 XXXXX| | | XXXXXXX| | | XXXXXXX|5 | | XXXXXXX|2 13 29 |9.1 20.1 | XXXXXXXX| | | XXXXXXXXXX|18 |6.2 21.1 | 0 XXXXXXXX|31 | | XXXXXXXXX|7 22 |15.1 | XXXXXXXX|27 |6.1 28.1 | XXXXXXXX|4 14 20 |17.1 | XXXXXXXX|3 16 19 | | XXXXXXXXX|23 | | -1 XXXXXX|10 17 | | XXXXXXX| | | XXXX| | | XXXX|1 | | XXXXX|12 | | XXX|25 | | -2 XXX|24 | | XX| | | XX| | | X| | | X| | | X| | | -3 X| | | X| | | | | |=======================================================================================Each 'X' represents 1080.9 cases=======================================================================================
Figure 23: Comparison of Mathematics Item Difficulty and Achievement Distributions
Scal
ing
outc
omes
PIS
A 2
00
0 t
ech
nic
al r
epor
t
152
Figure 24 shows a plot of the item difficultiesand item step parameters, and the studentachievement distribution for science. The figureshows that the selected items for the main study,on average, have about a 50 per cent chance ofbeing answered correctly by the average studentin the tested population.
Test Reliability
A second test characteristic that is of importanceis the test reliability. Table 41 shows thereliability for each of the three overall scales(combined reading literacy, mathematical literacyand scientific literacy) before conditioning andbased upon three separate scalings. Thereliability for each domain and each country,after conditioning, is reported later.2
Figure 24: Comparison of Science Item Difficulty and Achievement Distributions
+item +item*step--------------------------------------------------------------------------------------- | | | | | | 3 | | | | | | | | | | | | X| | | | | | XX| | | 2 XX| | | X| | | XX| | | XXX| | | XXXX|8 10 | | XXXX| |14.1 | XXXXXX|3 | | 1 XXXXXX|14 | | XXXXX|24 | | XXXXXXX|25 33 | | XXXXXXX|7 |8.1 24.1 | XXXXXXX|19 29 32 | | XXXXXXXX|2 12 18 | | XXXXXXXXXX|5 13 21 | | 0 XXXXXXXX|9 | | XXXXXXXX| |2.1 | XXXXXXXX|23 26 30 | | XXXXXXXX|11 31 34 | | XXXXXXXX|1 4 6 17 | | XXXXXXXXX|15 | | XXXXXX| | | XXXXXXX|16 | | -1 XXXX| | | XXXX|22 28 | | XXXXX|20 | | XXX| | | XXX| | | XX| | | XX| | | -2 X| | | X| | | X| | | X| | | |27 | | | | | | | | -3 | | | | | |=======================================================================================Each 'X' represents 1061.1 cases=======================================================================================
2 The reliability index is an internal-consistency-likeindex estimated by the correlation betweenindependent plausible value draws.
Table 41: Reliability of the Three Domains BasedUpon Unconditioned Unidimensional Scaling
Scale Reliability
Mathematics 0.81
Reading 0.89
Science 0.78
Domain Inter-correlations
Correlations between the ability estimates forindividual students in each of the three domains,the so-called latent correlations, as estimated byConQuest (Wu, Adams and Wilson, 1997) aregiven in Table 42. It is important to note thatthese latent correlations are unbiased estimates ofthe true correlation between the underlying latentvariables. As such they are not attenuated by theunreliability of the measures and will generally behigher than the typical product momentcorrelations that have not been disattenuated forunreliability.
Scal
ing
outc
omes
153
Table 42: Latent Correlation Between the ThreeDomains
Scale Reading Science
Mathematics 0.819 0.846
Science 0.890
Reading Sub-scales
A five-dimensional scaling was performed on theachievement (cognitive) data, consisting of:• Scale 1: mathematics items (M)• Scale 2: reading items—retrieving information
(R1)• Scale 3: reading items—interpreting texts (R2)• Scale 4: reading items—reflection and
evaluation (R3)• Scale 5: science items (S).
Table 43 shows the latent correlationsbetween each pair of scales.
Table 43: Correlation Between Scales
Scale R1 R2 R3 S
M 0.854 0.836 0.766 0.846
R1 0.973 0.893 0.889
R2 0.929 0.890
R3 0.840
The correlations between the three readingsub-scales are quite high. The highest is betweenthe retrieving information and interpreting textssub-scales, and the lowest is between theretrieving information and reflection andevaluation sub-scales.
The correlations between the mathematicsscale and the other scales are a little above 0.80,except the correlation with the reflection andevaluation sub-scale, which is estimated to beabout 0.77.
The science scale correlates with themathematics scale at 0.85. It correlates at about0.89 with the first two reading sub-scales and at0.84 with the reflection and evaluation sub-scale.It appears that the science scale correlates morehighly with the reading sub-scales than with themathematics scale.
Item Name Countries
R040Q02 Greece, Mexico
R040Q06 Italy
R055Q03 Austria, Germany, Switzerland (Ger.)
R076Q03 England, Switzerland (Ger.)
R076Q04 England
R076Q05 Belgium (Fl.), Netherlands
R091Q05 Russian Federation, Switzerland (Ger.)
R091Q07B Sweden
R099Q04B Poland
R100Q05 Belgium (Fl.), Netherlands
R101Q08 Canada (Fr.)
R102Q04A Korea
R111Q06B Switzerland (Ger.)
R119Q04 Hungary
R216Q02 Korea
R219Q01T Italy
R227Q01 Spain
R236Q01 Iceland
R237Q03 Korea
R239Q02 Switzerland (Ger.)
R246Q02 Korea
M033Q01 Brazil
M155Q01 Japan, Switzerland (Ger.)
M155Q03 Austria, Switzerland (Ger.)
M155Q04 Switzerland (Ger.)
S133Q04T Austria, Germany, Switzerland (Ger.)
S268Q02T Iceland, NetherlandsS268Q06 Switzerland (Ital.)
Figure 25: Items Deleted for Particular Countries
Scaling OutcomesThe procedures for the national andinternational scaling are outlined in Chapter 9and need not be reiterated here.
National Item Deletions
The items were first scaled by country and theirfit was considered at the national level, as was theconsistency of the item parameter estimates acrosscountries. Consortium staff then adjudicated items,considering the items’ functioning both within andacross countries in detail. Those items consideredto be dodgy (see Chapter 9) were then reviewed in
PIS
A 2
00
0 t
ech
nic
al r
epor
t
154
consultation with National Project Managers(NPMs). The consultations resulted in the deletionof a few items at the national level. The deleteditems, listed in Figure 25, were recoded as notapplicable and were included in neither theinternational scaling nor in generating plausiblevalues.
International Scaling
The international scaling was performed on thecalibration data set of 13 500 students (500 randomly selected students from each of27 countries). The item parameter estimates fromthis scaling are reported in Appendices 1, 2 and 3.
Generating Student Scale Scores
Applying the conditioning approach described inChapter 9 and anchoring all of the itemparameters at the values obtained from theinternational scaling, weighted likelihoodestimates (WLEs) and plausible values weregenerated for all sampled students.3 Table 44gives the reliabilities at the international level forthe generated scale scores.
Table 44: Final Reliability of the PISA Scales
Scale Reliability
Mathematics 0.90
Reading (Combined) 0.93
Reading-R1 0.91
Reading-R2 0.92
Reading-R3 0.90
Science 0.90
Differential ItemFunctioning
In Rasch modelling, an item is considered to exhibitdifferential item functioning (DIF) if the responseprobabilities for that item cannot be fully explainedby the ability of the student and a fixed set of dif-
3 As described later in the chapter, a booklet effectwas identified during national scaling. This meant thatthe planned scaling procedures had to be modifiedand a booklet correction made. Procedures for thebooklet effect correction are described in a latersection of this chapter.
ficulty parameters for that item. In other words,the DIF analysis identifies items that are unusuallyeasy or difficult for one group of students relativeto another. The key is unusually. In DIF analysis,the aim is not to find items with a higher or alower p-value for one group than another—sincethis may result from genuine differences in theability levels of the two groups. Rather, theobjective is to identify items that appear to betoo difficult or too easy, after having controlledfor differences in the ability levels of the twogroups. The PISA main study uses Raschmethods to explore gender DIF.
When items are scaled with the Rasch model,the origin of the scale is arbitrarily set at zeroand student proficiency estimates are thenexpressed on the scale. A student with a score of0 has a probability equal to 0.50 of correctlyanswering an item with a Rasch difficultyestimate of 0. The same student has a probabilityhigher than 0.50 for items with negativedifficulty estimates and a probability lower than0.50 for items with positive difficulty estimates.
If two independent Rasch analyses are run onthe same item set, using the responses from a sub-sample consisting of females only and a sub-sampleof males only, the first analysis estimates the Raschitem difficulty that best fits for females (indepen-dently of males’ data), and the second analysis givesthe analogous estimate for the males. Two indepen-dent difficulty estimates are available. If the malesare better achievers than females on average, rel-ative item difficulty will not necessarily be affected.4
Figures 26, 27 and 28 show a scatter-plot ofthe difficulty parameter estimates for male andfemale students for the three domains respectively.Each item is represented by one point in theplots. The horizontal axis is the item difficultyestimate for females and the vertical axis is theitem difficulty estimate for males. All non-biaseditems are on the diagonal. The further the item isfrom the diagonal, the bigger the gender DIF.
Most of the items, regardless of the domain,show a statistically significant DIF, which is notsurprising, since the standard error of the itemparameters is inversely proportional to the samplesize. Most of the differences between the estimatesof item parameters respectively for males and fe-males are quite small. The graphical displays alsoshow that there are only a small number of outliers.
4 The DIF analysis was performed in one run of themulti-facet model of ConQuest. The Item by Genderinteraction term was added to the standard model.
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3Estimates for Girls
Est
imat
es f
or B
oys
Est
imat
es f
or B
oys
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3Estimates for Girls
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3Estimates for Girls
Est
imat
es f
or B
oys
Figure 26: Comparison of ItemParameter Estimates for Males andFemales in Reading
Figure 27: Comparison of ItemParameter Estimates for Males andFemales in Mathematics
Figure 28: Comparison of ItemParameter Estimates for Males andFemales in Science
Scal
ing
outc
omes
155
PIS
A 2
00
0 t
ech
nic
al r
epor
t
156
Test Length
After the field trial, NPMs expressed someconcern regarding test length. The field trial datawere analysed using three sources ofinformation: not-reached items; field trial timingvariables; and the last question in the field trialStudent Questionnaire, which asked students ifthey had had sufficient time to do the test and,if not, how many items they had had time to do.‘Session’ in the discussion below refers to one oftwo equal parts of the testing time (i.e., eachsession was intended to take one hour).
The results of these analyses were thefollowing:• Based on the pattern of missing and not-
reached items for both sessions, thereappeared to be no empirical evidence tosuggest that two 60-minute testing sessionswere too long. If they were, one would haveexpected that the tired students would havegiven up earlier in the second session than inthe first, but this was not the case.
• Fatigue may also have an effect on studentachievement. Unfortunately, the field trial itemallocation did not permit item anchoringbetween sessions 1 and 2 (except for a fewmathematics items) so that it was not possibleto differentiate any item difficulty effect orfatigue effects.
• The results of the various analyses suggestedthat the field trial instruments were a little toolong. The main determinant of test length (forreading) was the number of words thatneeded to be read. Analysis suggested thatabout 3 500 words per session should be setas an upper limit for the English version. Inthe field trial, test sessions where there weremore words resulted in a substantially largernumber of not-reached items.
• The mathematics and science sessions werealso a bit too long, but this may have beendue to the higher than desired difficulty forthese items in the field trial.The main study instruments were built with
the above information in mind, especiallyregarding the number of words in the test.
For the main study, missing responses for thelast items in each session were also recoded asnot reached. But, if the student was not entirelypresent for a session, the not-reached items wererecoded as not applicable. This means that these
missing data are not included in the resultspresented in this section and that these students’ability estimates were not affected by notresponding to these items when proficiencyscores were computed.
Table 45 shows the number of missing responsesand the number of missing responses recoded asnot reached, by booklet, according to these rules.Table 46 shows this information by country.
Table 45: Average Number of Not-Reached Itemsand Missing Items by Booklet and Testing Session
Session 1 Session 2Booklet Missing Not Missing Not
Reached Reached
0 5.68 1.59 NA NA
1 1.40 0.60 4.70 1.10
2 0.89 0.23 2.44 0.87
3 2.20 0.70 1.97 0.49
4 2.66 1.60 3.11 0.96
5 1.64 1.51 3.95 0.69
6 1.26 0.77 1.77 0.29
7 1.82 0.91 1.75 0.31
8 3.36 1.15 2.91 1.08
9 3.63 0.86 2.49 1.00
Total 2.10 0.93 2.79 0.76
Most of the means of not-reached items are be-low one in the main study, while more than 50 percent of these means were higher than one for thefield trial. As the aver-ages of not-reached itemsper session and per booklet are quite similar, itappears that test length was quite similar acrossbooklets for each session.
Like the field trial, the average number of not-reached items differs from one country toanother. It is worth noting that countries withhigher averages of not-reached items also havehigher averages of missing data.
Tables 47 and 48 provide the percentagedistribution of not-reached items per booklet andper session. The percentage of students who reachedthe last item ranges from 70 to 95 per cent forthe first session and from 76 to 95 per cent forthe second session (i.e., the percentages ofstudents with zero not-reached items).
Table 46: Average Number of Not-Reached Itemsand Missing Items by Country and Testing Session
Session 1 Session 2Country Missing Not Missing Not
Reached Reached
Australia 1.61 0.64 2.23 0.53Austria 2.09 0.53 2.79 0.41Belgium 1.85 0.88 2.53 0.95Brazil 4.16 5.26 5.31 3.98Canada 1.31 0.54 1.80 0.39Czech Republic 2.50 1.18 3.19 0.65Denmark 2.65 1.32 3.57 1.09Finland 1.39 0.46 1.96 0.41France 2.44 1.09 2.99 0.91Germany 2.69 0.98 3.41 0.66Greece 2.95 1.23 4.75 1.40Hungary 2.79 1.67 3.59 1.23Iceland 1.86 0.77 2.58 0.74Ireland 1.31 0.55 1.86 0.48Italy 3.12 1.43 4.14 1.21Japan 2.32 0.80 3.22 0.70Korea 1.28 0.22 1.88 0.29Latvia 3.36 1.56 4.42 1.98Liechtenstein 2.67 0.86 3.50 0.78Luxembourg 3.01 1.57 4.61 1.07Mexico 2.37 1.38 3.10 1.13Netherlands 0.46 0.18 0.70 0.12New Zealand 1.36 0.49 1.98 0.45Norway 2.26 0.69 3.23 0.88Poland 3.16 0.87 4.54 1.00Portugal 2.60 1.13 3.47 0.66Russian Federation 3.01 2.43 3.71 1.88Spain 2.08 1.31 2.93 1.17Sweden 2.08 0.75 2.81 0.73Switzerland 2.34 0.82 3.14 0.59United Kingdom 1.51 0.35 2.25 0.41United States 1.27 0.77 1.95 0.64
Table 47: Distribution of Not-Reached Items by Booklet, First Testing Session
No. of Not- BookletReached Items 0 1 2 3 4 5 6 7 8 9
0 70.1 86.7 95.1 87.4 70.9 79.5 85.9 83.7 81.9 84.11 13.4 2.7 0.3 0.7 3.0 1.2 1.0 2.7 3.9 3.42 0.7 2.3 0.9 1.6 4.8 1.6 0.8 3.9 1.3 1.03 1.3 2.2 0.3 3.9 3.3 4.4 1.7 2.5 0.9 1.84 1.3 1.1 0.7 0.2 2.7 1.0 4.7 0.4 1.8 1.45 1.7 2.2 0.4 0.9 3.8 0.5 1.6 0.5 0.7 2.36 0.7 0.2 1.8 0.4 1.5 1.0 0.6 0.4 0.8 1.5
> 6 10.8 2.6 0.5 4.9 10.0 10.8 3.7 5.9 8.7 4.5
Magnitude of BookletEffects
After scaling the PISA 2000 data for eachcountry separately, achievement scores formathematics, reading and science could becompared across countries and across booklets.(In these analyses, some sub-regions withincountries were considered as countries.)
Tables 49, 50 and 51 present student scalescores for the three domains, standardised tohave a mean of 10 and a standard deviation of 2for each domain and country combination. Thetable rows represent countries (or sub-regionswithin countries) and the columns representbooklets. The purpose of these analyses andtables is to examine the nature of any bookleteffects, not to show countries inorder countries are therefore not named.
The variation in the means between bookletsis greater than was expected. As the scaling wassupposed to equate the booklets and because thebooklets were systematically rotated withinschools, it was expected that the only between-booklet variance would be sampling variance.
The variations observed between booklets arequite stable, however, across countries, leaving apicture of systematically easier and harderbooklets. The booklet means over countries,shown in the last row of each table, indicatemean booklet differences up to 0.52 (one-quarterof a student standard deviation).5
Scal
ing
outc
omes
157
5 Since the Special Education booklet (Booklet 0) wasnot randomly assigned to students, the means wereexpected to be lower than 10.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
158
Table 48: Distribution of Not-Reached Items by Booklet, Second Testing Session
No. of Not- BookletReached Items 1 2 3 4 5 6 7 8 9
0 76.2 78.7 89.6 85.4 87.3 95.1 91.8 86.9 79.61 3.9 12.3 3.1 1.6 3.0 1.4 2.2 1.0 4.22 2.9 0.9 2.0 1.9 0.9 0.4 1.1 1.1 2.43 6.9 0.6 1.3 2.3 1.5 0.2 0.6 1.0 1.44 1.5 1.4 0.5 1.2 1.0 0.5 1.1 1.5 2.85 0.7 0.5 0.3 0.7 1.9 0.2 1.0 0.6 3.16 3.7 0.4 0.3 0.5 1.3 0.3 0.5 1.4 1.2
> 6 4.2 5.2 2.9 6.4 3.1 1.9 1.7 6.5 5.3
Note: Booklet 0 required only one hour to complete and is therefore not shown in this table.
Table 49: Mathematics Means for Each Bookletand Country
Booklet0 1 3 5 8 9 Mean
. 9.70 10.19 9.90 10.43 9.78 10.00
. 9.80 10.26 9.60 10.51 9.78 10.006.09 9.54 10.36 9.68 10.64 9.80 10.00. 9.48 10.33 9.88 10.67 9.64 10.00. 9.62 10.10 9.96 10.57 9.75 10.00. 9.70 10.20 9.75 10.48 9.86 10.00. 9.45 10.21 9.80 10.60 9.95 10.00. 9.60 10.28 9.76 10.59 9.78 10.00. 9.61 10.24 9.83 10.55 9.77 10.006.60 9.83 10.26 9.94 10.47 9.77 10.007.74 9.75 10.27 9.99 10.24 9.95 10.00. 9.73 10.11 9.79 10.56 9.81 10.00. 9.78 9.98 9.83 10.64 9.75 10.00. 9.47 10.07 10.25 10.55 9.64 10.00. 9.85 10.16 9.83 10.39 9.77 10.00. 9.81 9.98 9.63 10.63 9.93 10.00. 9.82 10.12 9.82 10.30 9.94 10.006.79 9.79 10.38 9.88 10.65 10.01 10.00. 9.58 10.29 9.87 10.41 9.85 10.00. 9.74 10.26 9.79 10.30 9.91 10.00. 9.66 10.13 9.79 10.45 9.98 10.00. 9.59 10.39 9.74 10.58 9.70 10.009.48 10.25 9.85 9.93 11.03 9.80 10.007.51 9.90 10.48 9.82 10.61 9.85 10.00. 9.79 10.35 9.75 10.52 9.60 10.00. 9.81 9.98 9.85 10.72 9.68 10.00. 9.72 10.18 9.85 10.30 9.94 10.00. 9.69 10.32 9.85 10.44 9.62 10.00. 9.52 10.41 9.93 10.54 9.61 10.008.59 9.72 10.39 10.01 10.67 9.63 10.00. 9.60 10.27 9.75 10.49 9.89 10.006.36 9.78 10.20 9.87 10.38 9.95 10.007.47 9.72 10.20 10.01 10.59 9.80 10.008.73 9.68 10.23 9.85 10.52 9.81 10.007.54 9.70 10.20 9.86 10.52 9.78Note: Scaled to a mean of 10 and standard deviation of 2.
Each row represents a country or a subnational community that was treated independently in the study.
These differences would affect the abilityestimates of the students who worked on easieror harder booklets, and therefore booklet effectsneeded to be explained and corrected for in anappropriate way in subsequent scaling.
Exploratory analyses indicated that an ordereffect that was adversely affecting the between-booklet equating was the most likely explanationfor this booklet effect. This finding is illustratedin Figure 29, which compares item parameterestimates for cluster 1 obtained from threeseparate scalings—one for each booklet to whichcluster 1 was allocated. For the purposes of thiscomparison, the item parameters for readingwere re-estimated per booklet on the calibrationsample of 13 500 students, (about 1 500 studentsper booklet). To be able to compare the itemparameter estimates, they were centred on cases,assuming no differences in ability across bookletsin the population level. Figure 29 therefore showsthe deviations of each item from the mean para-meter for that item over all booklets in which itappeared. Each item of clusters 1 to 8 was admin-istered in three different booklets, and the itemsin cluster 9 were administered in two differentbooklets. For each item, the differences between thethree (two) estimates and their mean is displayedand grouped into one figure per cluster. The re-maining figures are included in Appendix 6.
Except for the first four items, the items incluster 1 were easier in Booklet 1 than in Booklets 5and 7. The standard errors, not printed in the fig-ures, ranged from about 0.04 to 0.08 for most items.
The legend indicates the positions of thecluster in the booklets: cluster 1 is the first clusterin the first half of Booklet 1 (Pos. 1.1), the
159
Table 50: Reading Means for Each Booklet and Country
Booklet0 1 2 3 4 5 6 7 8 9 Mean
. 10.07 10.43 10.15 9.63 10.04 10.32 9.99 9.64 9.73 10.00
. 10.48 10.18 10.09 9.64 10.02 9.95 9.84 9.68 10.13 10.00
4.00 10.05 10.24 10.13 9.72 10.08 10.11 9.96 9.88 9.87 10.00
. 9.89 10.22 10.22 9.73 10.04 10.14 10.22 9.80 9.74 10.00
. 10.13 10.41 10.12 9.75 10.01 10.17 9.89 9.73 9.78 10.00
. 9.95 10.22 10.01 9.88 9.89 10.15 10.06 9.87 9.97 10.00
. 10.06 10.44 10.11 9.66 9.98 10.27 9.91 9.53 10.03 10.00
. 9.85 10.26 10.21 9.96 10.13 10.11 10.07 9.70 9.71 10.00
. 9.98 10.38 10.14 9.81 10.09 10.05 9.98 9.71 9.88 10.00
4.85 10.18 10.38 10.08 9.82 10.03 10.01 10.14 9.83 9.96 10.00
7.10 10.07 10.07 10.17 9.84 10.11 10.23 9.96 9.81 9.96 10.00
. 10.15 10.31 10.21 9.79 9.99 10.09 10.13 9.52 9.81 10.00
. 10.26 10.36 10.26 9.86 9.90 10.03 9.77 9.75 9.82 10.00
. 9.98 10.44 9.94 9.65 9.98 10.13 10.10 9.90 9.89 10.00
. 10.23 10.21 9.99 9.78 9.89 10.20 9.75 9.92 10.02 10.00
. 10.07 10.39 10.13 9.83 9.86 10.20 9.94 9.84 9.70 10.00
. 10.13 10.23 10.03 9.82 9.92 10.07 10.02 9.80 9.97 10.00
5.94 9.98 10.16 10.00 9.78 10.00 10.25 9.97 10.35 10.42 10.00
. 10.10 10.31 10.10 9.72 9.99 10.07 9.93 9.77 10.00 10.00
. 10.01 10.34 10.14 9.88 10.06 10.00 9.96 9.66 9.93 10.00
. 10.25 10.47 9.87 9.64 9.80 10.18 9.94 9.95 9.91 10.00
. 10.09 10.46 10.08 9.61 9.99 10.08 9.98 9.71 10.00 10.00
8.09 10.35 10.70 10.38 10.07 10.13 10.50 10.39 10.20 10.33 10.00
5.86 10.20 10.45 10.21 9.78 10.10 10.25 10.04 10.00 10.11 10.00
. 10.24 10.58 10.07 9.63 9.94 9.99 9.92 9.51 10.14 10.00
. 10.00 10.50 10.00 9.80 9.99 10.20 9.95 9.84 9.69 10.00
. 10.21 10.51 10.00 9.67 10.01 10.01 10.09 9.71 9.81 10.00
. 10.23 10.42 10.14 9.61 9.98 10.21 10.03 9.56 9.95 10.00
. 9.96 10.29 10.22 9.79 10.00 10.17 10.04 9.68 9.87 10.00
7.95 10.07 10.39 10.24 9.78 10.04 10.28 10.08 9.68 10.09 10.00
. 9.99 10.29 10.12 9.82 9.93 10.15 9.98 9.89 9.82 10.00
5.97 10.12 10.33 10.08 9.92 10.15 10.19 9.85 9.70 9.84 10.00
7.18 10.06 10.42 10.16 9.87 10.05 10.27 10.04 9.82 9.66 10.00
7.48 10.10 10.36 10.11 9.77 10.00 10.14 9.99 9.78 9.94 10.00
6.44 10.11 10.34 10.11 9.78 10.01 10.15 9.99 9.78 9.90
Note: Scaled to a mean of 10 and standard deviation of 2.
Scal
ing
outc
omes
PIS
A 2
00
0 t
ech
nic
al r
epor
t
160
described by a booklet effect, since they vary notonly with respect to the booklets but differentially.
Nevertheless, there is evidence of an ordereffect of some kind. The parameters from thefirst position in the first half are lower than theparameters for other positions for almost everyitem, although the differences are occasionallynegligible. The relation of the other positions(for clusters 1 to 8) is less clear. For clusters 2,4, 5 and 6, the difficulties of position 1.1 and 2.1look similar, whereas position 1.2 gives higheritem parameter estimates. For clusters 1 and 7,positions 1.2 and 2.1 show a similar difficultyand are both higher than in position 1.1. Strangely,for cluster 3 position 1.2 is similar to position 1.1,and position 2.1 looks much more difficult. Thedifferences in clusters 8 and 9 may well beexplained by an order effect. Interestingly, afterhaving some mathematics and science items,position 2.2 looks much harder than after havingreading items only (Booklet 8).
Correction of the Booklet Effect
Modelling the effect
Modelling the order effects in terms of itempositions in a booklet or at least in terms ofcluster positions in a booklet would result in avery complex model. For the sake of handling inthe international scaling, the effect was modelledat booklet level.
Booklet effects were included in themeasurement model to prevent confounding itemdifficulties and booklet effects. For the ConQuestmodel statement, the calibration model was:
item + item*step + booklet.The booklet parameter, formally defined in the
same way as item parameters, reflects bookletdifficulty.
This measurement model provides item para-meter estimates that are not affected by the bookletdifficulties and booklet difficulty parameters thatcan be used to correct student ability scores.
Table 51: Science Means for Each Booklet andCountry
Booklet0 2 4 6 8 9 Mean
. 9.65 9.89 10.29 9.67 10.52 10.00
. 9.66 9.67 10.44 9.39 10.79 10.007.72 9.65 9.92 10.43 9.68 10.34 10.00. 9.68 9.72 10.40 9.58 10.62 10.00. 9.50 9.85 10.34 9.82 10.50 10.00. 9.60 9.89 10.32 9.82 10.38 10.00. 9.50 9.77 10.29 9.93 10.50 10.00. 9.79 9.86 10.21 9.68 10.47 10.00. 9.76 9.78 10.28 9.76 10.40 10.007.07 9.89 9.95 10.23 9.88 10.31 10.008.08 9.64 9.84 10.34 10.07 10.28 10.00. 9.84 9.76 10.13 9.80 10.46 10.00. 9.53 9.59 10.08 10.03 10.78 10.00. 9.44 9.94 10.47 9.68 10.45 10.00. 9.77 9.83 10.25 9.74 10.41 10.00. 9.65 9.79 10.08 10.09 10.47 10.00. 9.85 9.77 10.14 9.94 10.30 10.006.87 9.98 9.91 10.43 10.00 10.39 10.00. 9.69 9.83 10.12 9.80 10.56 10.00. 9.68 10.00 10.12 9.66 10.54 10.00. 9.72 9.67 10.22 10.16 10.22 10.00. 9.93 9.64 10.28 9.77 10.39 10.009.55 9.73 10.03 10.35 9.60 11.01 10.007.39 9.87 9.95 10.47 9.80 10.64 10.00. 9.84 9.68 10.22 9.63 10.64 10.00. 9.57 9.80 10.21 10.03 10.45 10.00. 9.91 9.82 10.18 9.84 10.25 10.00. 9.55 9.81 10.41 9.53 10.70 10.00. 9.64 10.06 10.41 9.45 10.45 10.008.54 9.81 9.74 10.42 9.75 10.71 10.00. 9.57 9.95 10.16 9.92 10.40 10.006.54 9.86 9.71 10.36 9.92 10.30 10.006.70 9.77 9.91 10.15 10.07 10.54 10.008.77 9.72 9.83 10.25 9.81 10.49 10.007.72 9.70 9.82 10.26 9.80 10.47
Note: Scaled to a mean of 10 and standard deviation of 2.
second cluster in the first half in Booklet 5 (Pos1.2), and the first cluster in the second half ofBooklet 7 (Pos 2.1). Items are listed in the orderin which they were administered.
The differences in item difficulties over allnine clusters (see Appendix 6) cannot simply be
Scal
ing
outc
omes
161
Estimating the parametersThe calibration model given above was used toestimate the international item parameters. It wasestimated using the international calibration sampleof 13 500 students, and not-reached items in theestimation were treated as not administered.
The booklet parameters obtained from thisanalysis were not used to correct for the bookleteffect. Instead, a set of booklet parameters wasobtained by scaling the whole international dataset using booklet as a conditioning variable andapplying a specific weight which gives eachOECD country6 an equal contribution7. Further,in this analysis, not-reached items were regardedas incorrect responses, in accordance with thecomputation of student ability scores reported inChapter 9. The students who responded to theSE booklet were excluded from the estimation. Thebooklet parameter estimates obtained are reportedin Table 52. Note that the table reports the devi-ation per booklet from the average within each ofthe three domains over all booklets and that there-fore these parameters add up to zero per domain.
In order to display the magnitude of theeffects, the booklet difficulty parameters arepresented again in Table 53, after theirtransformation into the PISA reporting scalewhich has a mean of 500 and SD of 100.
Table 52: Booklet Difficulty Parameters
Booklet Mathematics Reading Science
1 0.23 -0.05 -
2 - -0.23 0.18
3 -0.14 -0.07 -
4 - 0.12 0.10
5 0.13 -0.01 -
6 - -0.11 -0.13
7 - 0.01 -
8 -0.35 0.20 0.10
9 0.13 0.13 -0.24
Table 53: Booklet Difficulty Parameters Reportedon the PISA Scale
Booklet Mathematics Reading Science
1 17.6 -4.5 -
2 - -20.9 16.2
3 -10.7 -6.4 -
4 - 10.9 9.0
5 9.9 -0.9 -
6 - -10.0 -11.7
7 - 0.9 -
8 -26.7 12.8 9.0
9 9.9 11.8 -21.6
Figure 29: Item Parameter Differences for the Items in Cluster One
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Item
Par
amet
er D
iffe
renc
es
Pos. 1.1 (book 1) Pos 1.2 (book 5) Pos. 2.1 (book 7)
Item
6 Due to the decision to omit the Netherlands fromreports that focus on variation in achievement scores,the scores were not represented in this sample.7 These weights are computed to achieve an equalcontribution from each country. They can be based onany weights used in the countries (e.g., to adjust forthe sampling design) or unweighted data.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
162
Applying the correctionTo correct the student scores for the bookleteffects, two alternatives were considered:• to correct all students’ scores using one set of
the internationally estimated bookletparameters; and
• to correct the students’ scores using one set ofnationally estimated booklet parameters foreach country. These coefficients can beobtained as booklet regression coefficientsfrom estimating the conditioning models, onecountry at a time.All available sets of booklet parameters sum
to zero. Applying these parameters to everystudent in a country (having an equaldistribution of Booklets 1 to 9 in each country)does not change the mean country performance .Therefore the application of any correction doesnot change the country ranking.
With the international correction, all countriesare treated in exactly the same way, and it istherefore the most desirable option from atheoretical point of view. This is the option thatwas implemented.
From a statistical point of view, it leaves someslight deviations from equal booklet difficultiesin the data. These deviations vary over countries.Using the national correction, these deviationswould become negligible.
Outcomes of MarkerReliability Studies
This chapter reports the result of the variousmarker reliability studies that were implemented.The methodologies for these studies aredescribed in Chapter 10.
within-countryReliability Studies
Norman Verhelst
Variance Components Analysis
Tables 54 to 58 show the results of the variancecomponents analysis for the multiply-markeditems in mathematics, science and the threereading sub-scales (usually referred to as scalesfor convenience), respectively. The variancecomponents are each expressed as a percentageof their sum.
The tables show that those variancecomponents associated with markers (thoseinvolving γ) are remarkably small relative to theother components. This means that there are nosignificant systematic within-country markereffects
As discussed in Chapter 10, analyses of thetype reported here can result in negative varianceestimates. If the amount by which thecomponent is negative is small, then this is a signthat the variance component is negligible (nearzero). If the component is large and negative,then it is a sign that the analysis method isinappropriate for the data. In Tables 54 to 58,countries with large inadmissible -estimatesare listed at the bottom of the tables and aremarked with an asterisk. (Some sub-regionswithin countries were considered as countries for
these analyses.) For Poland, some of theestimates were so highly negative that theresulting numbers did not fit in the format of thetables. Therefore Poland has been omittedaltogether.
Generalisability Coefficients
The generalisability coefficients are computedfrom the variance components using:
.(88)
They provide an index of reliability for themultiple marking in each country.
I denotes the number of items and M thenumber of markers. By using different values forI and M, one obtains a generalisation of theSpearman-Brown formula for test-lengthening.In Tables 59 to 63, the formula is evaluated forthe six combinations of I = 8, 16, 24 and M =1, 2, using the variance component estimatesfrom the corresponding tables presented above.For the countries marked with ‘*’ in the abovetables, no values are displayed, because they falloutside the acceptable (0,1) range.
ρσ
σ
σσ σ σ σ σ
ααβ
ααβ γ αγ βγ ε
3
22
22 2 2 2 2( , ).. ..
'Y Y I
I M I M
v v =+
+ ++
++
×
ρ3
Chapter 14
PIS
A 2
00
0 t
ech
nic
al r
epor
t
164
Table 54: Variance Components for Mathematics
Student- Student- Item- Measure-Item Marker Marker ment
Student Item Marker Interaction Interaction Interaction ErrorComponent Component Component Component Component ComponentComponent
Country ( ) ( ) ( ) ( ) ( ) ( ) ( )
Australia 24.99 28.24 0.00 40.54 0.12 0.04 6.07
Austria 15.17 30.81 0.22 45.14 0.17 1.80 6.68
Belgium (Fl.) 23.29 24.86 0.02 46.59 0.06 0.04 5.15
Czech Republic 22.67 27.22 0.01 45.10 -0.02 0.07 4.95
Denmark 19.78 32.69 0.00 42.34 -0.01 0.07 5.14
England 27.49 31.12 0.00 38.10 0.07 0.03 3.20
Finland 15.93 29.44 0.01 51.89 0.00 0.02 2.70
France 18.11 31.27 0.01 43.66 0.08 0.13 6.73
Germany 20.92 26.36 0.00 46.39 -0.01 0.31 6.03
Greece 20.40 23.05 0.02 52.61 0.02 0.04 3.86
Hungary 24.52 23.97 0.01 47.36 -0.30 0.04 4.40
Iceland 22.35 29.47 0.03 43.97 0.10 0.03 4.05
Ireland 22.46 26.31 0.11 41.85 0.18 0.20 8.90
Italy 17.75 24.60 0.01 51.74 0.07 0.14 5.69
Japan 21.35 24.83 0.00 51.90 -0.06 0.02 1.97
Korea 20.32 29.04 -0.01 45.43 0.00 0.14 5.08
Mexico 14.06 19.58 -0.03 54.60 0.70 0.13 10.96
Netherlands 23.66 26.45 0.09 42.09 0.05 0.16 7.49
New Zealand 20.99 29.39 0.03 43.43 0.10 0.16 5.90
Norway 18.66 32.03 0.00 44.55 0.13 0.07 4.56
Portugal 18.56 31.90 -0.01 47.29 0.08 0.05 2.13
Russian Federation 24.39 24.19 0.03 47.17 0.02 0.01 4.18
Spain 19.84 28.64 0.10 44.63 0.19 0.09 6.50
Sweden 21.24 29.35 0.01 44.44 0.08 0.09 4.80
Switzerland (Fr.) 20.80 25.83 0.09 43.71 0.24 0.34 8.99
Switzerland (Ger.) 21.60 28.30 0.00 47.29 0.06 0.01 2.75
Switzerland (Ital.) 19.25 29.62 0.12 41.26 0.82 0.12 8.81
Belgium (Fr.)* 32.85 24.41 0.50 38.04 -29.66 -0.24 34.09
Brazil* 32.59 6.06 0.78 46.99 -32.39 -1.14 47.12
Latvia* 20.29 28.87 0.02 42.83 -1.81 0.26 9.54
Luxembourg* 17.45 23.87 0.03 47.06 -1.79 0.32 13.07
Scotland* 19.66 29.49 0.23 38.00 -3.23 2.10 13.76
United States* 37.21 25.25 1.14 31.21 -42.32 -1.53 49.04
* -coefficients were inadmissible (>1) in these countries.ρ3
σε2σ βγ
2σαγ2σαβ
2σ γ2σ β
2σα2
Ou
tcom
es o
f m
arke
r re
liab
ilit
y st
ud
ies
165
Table 55: Variance Components for Science
Student- Student- Item- Measure-Item Marker Marker ment
Student Item Marker Interaction Interaction Interaction ErrorComponent Component Component Component Component ComponentComponent
Country ( ) ( ) ( ) ( ) ( ) ( ) ( )
Australia 26.38 11.16 -0.12 46.30 0.06 0.88 15.34
Austria 20.47 9.98 0.47 54.53 0.02 0.58 13.95
Belgium (Fl.) 18.30 17.01 0.05 54.99 -0.14 0.11 9.68
Czech Republic 23.13 10.46 0.01 53.53 0.17 0.35 12.34
Denmark 21.77 10.39 0.01 59.94 -0.02 0.10 7.82
England 19.54 9.03 0.23 55.23 0.29 0.86 14.82
Finland 20.49 13.26 0.04 51.44 0.10 0.27 14.41
France 20.47 13.80 0.00 60.20 0.12 0.04 5.37
Germany 22.23 11.72 0.09 52.88 -0.05 0.24 12.89
Greece 21.29 9.42 0.01 60.84 0.27 0.07 8.11
Hungary 22.42 10.11 0.05 49.84 -0.12 0.78 16.92
Iceland 20.17 8.63 0.15 61.15 0.08 0.06 9.76
Ireland 22.56 14.26 0.06 56.78 0.01 0.05 6.29
Italy 20.82 8.77 0.05 65.03 -0.04 0.04 5.33
Japan 19.11 12.01 0.01 65.77 0.07 0.12 2.91
Korea 25.40 10.13 0.07 52.04 0.06 0.17 12.13
Mexico 19.49 14.20 0.05 57.88 0.19 0.08 8.11
Netherlands 23.04 8.57 0.07 56.11 0.05 0.16 12.00
New Zealand 20.05 10.01 -0.02 59.56 0.12 0.22 10.05
Norway 22.20 8.16 0.00 65.35 -0.02 0.01 4.30
Portugal 22.99 12.26 0.02 57.47 0.10 0.08 7.08
Spain 21.12 8.36 0.01 64.99 0.03 0.05 5.44
Switzerland (Fr.) 24.05 7.11 0.06 48.92 0.14 0.66 19.07
Switzerland (Ger.) 22.93 8.67 0.35 55.60 0.14 0.42 11.90
Switzerland (Ital.) 20.09 7.18 0.00 54.24 0.15 1.02 17.33
Belgium (Fr.)* 24.13 10.15 0.06 57.55 -0.80 0.22 8.68
Brazil* 21.47 7.49 0.06 58.48 -1.18 0.10 13.57
Latvia* 16.74 7.33 0.22 55.63 -1.30 0.58 20.81
Luxembourg* 25.40 9.63 0.17 50.02 -1.55 0.79 15.54
Russian Federation* 31.16 8.83 1.61 47.67 -23.34 0.83 33.24
Scotland* 25.89 9.64 0.14 54.59 -3.48 0.00 13.22
Sweden* 21.12 6.25 0.92 51.20 -28.85 -0.12 49.49
United States* 34.29 6.96 0.73 47.51 -30.26 -1.37 42.13
* -coefficients were inadmissible (>1) in these countries.ρ3
σε2σ βγ
2σαγ2σαβ
2σ γ2σ β
2σα2
Table 56: Variance Components for Retrieving InformationStudent- Student- Item- Measure-
Item Marker Marker mentStudent Item Marker Interaction Interaction Interaction Error
Component Component Component Component Component ComponentComponent
Country ( ) ( ) ( ) ( ) ( ) ( ) ( )
Australia 22.40 19.01 -0.02 50.36 0.01 0.22 8.01
Austria 14.97 23.60 0.00 54.93 -0.12 0.24 6.39
Belgium (Fl.) 11.89 22.59 0.15 50.42 0.10 1.43 13.42
Czech Republic 17.14 19.81 0.03 53.73 -0.23 0.46 9.05
Denmark 13.24 24.56 0.01 54.22 0.16 0.25 7.56
England 14.79 22.14 0.00 59.71 0.01 0.00 3.35
Finland 18.97 18.30 0.02 55.93 -0.11 0.07 6.81
France 19.52 19.37 0.37 52.32 1.01 0.11 7.30
Germany 19.28 21.95 0.00 54.03 -0.09 0.03 4.80
Greece 17.92 15.78 0.04 56.92 -0.26 0.16 9.45
Hungary 16.69 17.65 0.02 58.80 -0.01 0.19 6.65
Iceland 16.42 18.09 0.00 62.69 0.07 0.03 2.71
Ireland 16.62 20.04 0.08 53.30 0.08 0.88 9.01
Italy 9.22 22.47 0.04 60.33 -0.10 0.18 7.86
Japan 17.93 15.22 0.19 51.28 -0.08 0.54 14.92
Korea 11.45 23.50 0.02 55.59 -0.13 0.18 9.39
Mexico 20.34 18.28 0.07 54.53 -0.02 0.04 6.76
Netherlands 17.98 20.14 0.00 52.89 -0.81 0.26 9.54
New Zealand 20.29 19.62 0.04 51.82 0.09 0.19 7.94
Norway 15.66 17.79 0.00 61.43 0.21 0.17 4.74
Portugal 14.03 19.05 0.01 61.34 -0.03 0.12 5.48
Spain 18.43 15.34 0.56 42.02 0.10 2.62 20.93
Switzerland (Fr.) 17.17 20.42 0.12 52.69 0.04 0.30 9.25
Switzerland (Ger.) 16.43 23.52 0.02 50.02 0.00 1.32 8.69
Switzerland (Ital.) 17.46 17.29 0.04 55.23 0.27 0.39 9.32
Belgium (Fr.)* 35.88 21.17 2.94 38.23 -64.09 -1.95 67.83
Brazil* 21.59 19.19 1.16 49.15 -29.54 -0.37 38.82
Latvia* 27.99 9.17 1.12 51.86 -4.05 0.56 13.34
Luxembourg* 30.64 20.92 2.41 45.20 -57.98 -2.14 60.96
Russian Federation* 24.88 19.38 1.22 46.83 -28.09 0.28 35.51
Scotland* 16.70 18.53 0.10 58.15 -1.41 0.06 7.87
Sweden* 14.45 23.71 0.24 53.71 -10.53 0.41 18.00
United States* 29.73 21.35 1.09 42.19 -35.61 -1.23 42.48
* -coefficients were inadmissible (>1) in these countries.ρ3
σε2σ βγ
2σαγ2σαβ
2σ γ2σ β
2σα2
PIS
A 2
00
0 t
ech
nic
al r
epor
t
166
Table 57: Variance Components for Interpreting TextsStudent- Student- Item- Measure-
Item Marker Marker mentStudent Item Marker Interaction Interaction Interaction Error
Component Component Component Component Component ComponentComponent
Country ( ) ( ) ( ) ( ) ( ) ( ) ( )
Australia 16.46 22.63 0.10 47.56 0.09 0.35 12.80
Austria 15.57 16.43 0.09 52.88 -0.06 1.15 13.94
Belgium (Fl.) 14.90 23.80 0.04 45.18 0.09 0.81 15.18
Czech Republic 16.40 17.76 0.14 49.36 -0.01 1.18 15.16
Denmark 16.88 20.32 -0.02 54.40 0.00 0.28 8.15
England 11.41 27.78 0.02 54.11 -0.08 0.15 6.61
Finland 19.20 18.21 0.04 52.20 0.28 0.43 9.64
France 14.50 20.11 -0.02 53.75 -0.28 0.38 11.55
Germany 15.91 22.73 0.02 50.74 -0.22 0.27 10.55
Greece 13.80 22.63 -0.03 49.56 0.16 0.41 13.47
Hungary 15.62 16.96 0.16 54.81 -0.05 0.73 11.77
Iceland 14.56 22.13 0.05 56.53 -0.14 0.19 6.67
Ireland 12.95 19.45 0.21 50.44 0.07 1.41 15.47
Italy 9.94 22.70 0.07 57.23 -0.12 0.19 9.98
Japan 14.18 9.27 0.48 54.47 0.50 1.03 20.07
Korea 15.03 23.66 0.11 45.35 0.26 0.58 15.01
Mexico 16.29 22.00 0.05 51.38 0.13 0.18 9.97
Netherlands 19.30 15.15 0.02 52.76 -0.42 0.55 12.64
New Zealand 17.61 21.74 0.05 49.03 0.10 0.34 11.13
Norway 13.37 29.90 0.00 47.77 0.04 0.38 8.54
Portugal 13.00 22.16 0.04 56.14 0.23 0.16 8.27
Spain 16.07 16.36 0.83 41.68 0.56 1.42 23.09
Switzerland (Fr.) 15.65 18.91 0.04 50.29 0.22 0.71 14.18
Switzerland (Ger.) 17.49 17.64 0.10 49.75 -0.11 0.68 14.46
Switzerland (Ital.) 16.25 21.70 0.05 45.24 0.29 0.63 15.83
Belgium (Fr.)* 35.47 16.09 2.81 41.04 -62.53 -2.78 69.90
Brazil* 20.03 26.28 0.99 40.64 -29.96 0.26 41.77
Luxembourg* 31.46 17.93 2.84 44.04 -53.26 -1.79 58.78
Latvia* 23.28 14.08 0.46 45.50 -4.65 2.95 18.37
Russian Federation* 24.25 15.18 0.82 46.93 -25.58 0.52 37.89
Scotland* 14.04 17.09 0.12 57.51 -1.24 0.17 12.30
Sweden* 18.03 13.33 0.14 56.29 -10.16 -0.24 22.60
United States* 25.67 19.30 1.29 43.64 -36.52 -1.22 47.84
* -coefficients were inadmissible (>1) in these countries.ρ3
σε2σ βγ
2σαγ2σαβ
2σ γ2σ β
2σα2
Ou
tcom
es o
f m
arke
r re
liab
ilit
y st
ud
ies
167
168
Table 58: Variance Components for Reflection and Evaluation
Student- Student- Item- Measure-Item Marker Marker ment
Student Item Marker Interaction Interaction Interaction ErrorComponent Component Component Component Component ComponentComponent
Country ( ) ( ) ( ) ( ) ( ) ( ) ( )
Australia 15.53 25.75 0.41 35.55 0.31 0.75 21.69
Austria 14.90 19.96 0.81 37.20 0.57 1.44 25.13
Belgium (Fl.) 16.26 20.75 0.35 36.20 0.41 1.26 24.77
Czech Republic 15.65 20.78 1.07 32.83 0.69 1.96 27.02
Denmark 14.83 21.54 0.07 49.36 0.05 0.53 13.61
England 13.77 25.57 0.05 47.06 0.02 0.13 13.41
Finland 23.09 15.18 0.21 48.51 0.24 0.27 12.50
France 16.93 21.29 0.40 40.52 0.56 0.52 19.78
Germany 18.32 25.33 0.19 37.09 0.06 0.65 18.35
Greece 13.75 26.20 0.52 32.21 0.41 1.34 25.57
Hungary 15.60 20.52 0.66 39.65 0.16 1.25 22.15
Iceland 16.64 22.44 0.07 49.23 0.01 0.40 11.20
Ireland 14.24 18.67 0.96 41.43 0.34 1.51 22.85
Italy 10.64 26.50 0.50 42.79 0.57 0.99 18.02
Japan 15.13 10.34 1.57 36.68 1.76 1.79 32.73
Korea 11.70 25.36 0.68 29.10 0.78 2.57 29.82
Mexico 18.52 18.75 0.14 40.50 -0.74 0.88 21.95
Netherlands 17.91 24.33 0.37 37.62 0.05 0.45 19.26
New Zealand 14.06 21.79 0.57 43.90 0.50 0.80 18.37
Norway 16.67 21.93 0.28 37.37 -0.23 1.11 22.86
Portugal 18.63 24.58 0.24 34.85 0.50 1.12 20.08
Russian Federation 13.18 25.07 0.07 41.51 0.27 0.59 19.31
Spain 14.97 16.35 1.02 37.29 1.03 1.85 27.50
Switzerland (Fr.) 17.57 19.11 0.77 35.08 0.47 1.86 25.14
Switzerland (Ger.) 17.60 19.74 0.77 37.12 0.61 1.14 23.02
Switzerland (Ital.) 15.49 25.37 0.29 34.01 0.56 1.24 23.04
Belgium (Fr.)* 16.39 17.60 1.03 47.94 -7.09 0.58 23.55
Brazil* 32.02 20.54 2.15 36.32 -45.99 -1.72 56.68
Latvia* 27.95 17.69 2.58 42.76 -40.51 -1.18 50.71
Luxembourg* 21.17 17.66 1.16 34.52 -22.78 0.92 47.35
Scotland* 25.18 14.56 1.02 37.26 -2.91 1.08 23.81
Sweden* 21.58 20.05 1.21 35.26 -18.97 1.14 39.73
United States* 24.39 26.13 0.97 31.25 -28.29 -0.71 46.26
* -coefficients were inadmissible (>1) in these countries.ρ3
σε2σ βγ
2σαγ2σαβ
2σ γ2σ β
2σα2
PIS
A 2
00
0 t
ech
nic
al r
epor
t
Ou
tcom
es o
f m
arke
r re
liab
ilit
y st
ud
ies
169
Table 59: -Estimates for Mathematics
Country I = 8 I = 16 I = 24M = 1 M = 2 M = 1 M = 2 M = 1 M = 2
Australia 0.971 0.986 0.982 0.991 0.986 0.993
Austria 0.935 0.966 0.951 0.975 0.958 0.979
Belgium (Fl.) 0.976 0.988 0.985 0.992 0.988 0.994
Czech Republic 0.978 0.989 0.987 0.994 0.991 0.996
Denmark 0.975 0.987 0.986 0.993 0.990 0.995
England 0.986 0.993 0.991 0.995 0.993 0.996
Finland 0.985 0.992 0.991 0.995 0.993 0.997
France 0.961 0.980 0.976 0.988 0.981 0.991
Germany 0.971 0.985 0.984 0.992 0.989 0.994
Greece 0.981 0.990 0.988 0.994 0.991 0.996
Hungary 0.982 0.991 0.990 0.995 0.993 0.996
Ireland 0.951 0.975 0.967 0.983 0.973 0.986
Iceland 0.978 0.989 0.985 0.992 0.988 0.994
Italy 0.968 0.984 0.979 0.990 0.984 0.992
Japan 0.991 0.996 0.995 0.997 0.996 0.998
Korea 0.976 0.988 0.986 0.993 0.990 0.995
Mexico 0.909 0.952 0.926 0.962 0.934 0.966
Netherlands 0.963 0.981 0.977 0.988 0.982 0.991
New Zealand 0.967 0.983 0.979 0.989 0.984 0.992
Norway 0.972 0.986 0.981 0.990 0.985 0.992
Portugal 0.986 0.993 0.990 0.995 0.992 0.996
Russian Federation 0.981 0.991 0.989 0.994 0.992 0.996
Spain 0.958 0.979 0.970 0.985 0.975 0.987
Sweden 0.974 0.987 0.984 0.992 0.987 0.994
Switzerland (Fr.) 0.946 0.972 0.963 0.981 0.969 0.984
Switzerland (Ger.) 0.985 0.993 0.991 0.995 0.993 0.996Switzerland (Ital.) 0.922 0.960 0.936 0.967 0.941 0.970
ρ3
PIS
A 2
00
0 t
ech
nic
al r
epor
t
170
Table 60: -Estimates for Science
Country I = 8 I = 16 I = 24M = 1 M = 2 M = 1 M = 2 M = 1 M = 2
Australia 0.939 0.969 0.965 0.982 0.975 0.987
Austria 0.922 0.959 0.945 0.972 0.954 0.976
Belgium (Fl.) 0.952 0.975 0.970 0.985 0.978 0.989
Switzerland (Fr.) 0.917 0.957 0.948 0.973 0.961 0.980
Switzerland (Ger.) 0.919 0.958 0.950 0.974 0.962 0.981
Czech Republic 0.936 0.967 0.954 0.977 0.962 0.981
Denmark 0.943 0.971 0.966 0.982 0.975 0.987
England 0.967 0.983 0.981 0.990 0.986 0.993
Finland 0.976 0.988 0.985 0.992 0.989 0.994
France 0.932 0.965 0.957 0.978 0.968 0.984
Germany 0.944 0.971 0.965 0.982 0.973 0.986
Greece 0.972 0.986 0.981 0.991 0.985 0.993
Hungary 0.957 0.978 0.969 0.984 0.975 0.987
Iceland 0.972 0.986 0.982 0.991 0.987 0.993
Ireland 0.927 0.962 0.957 0.978 0.969 0.984
Italy 0.950 0.974 0.966 0.983 0.973 0.986
Japan 0.976 0.988 0.985 0.992 0.988 0.994
Korea 0.983 0.992 0.989 0.994 0.991 0.995
Netherlands 0.950 0.975 0.970 0.985 0.977 0.988
New Zealand 0.948 0.973 0.968 0.984 0.976 0.988
Norway 0.955 0.977 0.968 0.984 0.974 0.987
Portugal 0.983 0.991 0.990 0.995 0.993 0.996
Russian Federation 0.951 0.975 0.969 0.984 0.976 0.988
Spain 0.914 0.955 0.939 0.968 0.949 0.974
Sweden 0.967 0.983 0.979 0.989 0.984 0.992
ρ3
Table 61: -Estimates for Retrieving Information
Country I = 8 I = 16 I = 24M = 1 M = 2 M = 1 M = 2 M = 1 M = 2
Australia 0.965 0.982 0.980 0.990 0.986 0.993
Austria 0.963 0.981 0.978 0.989 0.984 0.992
Belgium (Fl.) 0.896 0.945 0.927 0.962 0.942 0.970
Denmark 0.951 0.975 0.970 0.985 0.978 0.989
England 0.977 0.989 0.987 0.993 0.991 0.995
Finland 0.981 0.990 0.988 0.994 0.991 0.996
France 0.868 0.929 0.908 0.952 0.925 0.961
Germany 0.947 0.973 0.968 0.984 0.977 0.988
Greece 0.967 0.983 0.980 0.990 0.986 0.993
Hungary 0.919 0.958 0.925 0.961 0.928 0.963
Iceland 0.965 0.982 0.978 0.989 0.984 0.992
Ireland 0.953 0.976 0.971 0.985 0.979 0.989
Italy 0.943 0.971 0.962 0.981 0.971 0.985
Japan 0.983 0.992 0.988 0.994 0.990 0.995
Korea 0.941 0.970 0.960 0.980 0.969 0.984
Mexico 0.920 0.958 0.948 0.973 0.960 0.980
Netherlands 0.938 0.968 0.960 0.980 0.970 0.985
New Zealand 0.967 0.983 0.980 0.990 0.985 0.992
Russian Federation 0.966 0.983 0.974 0.987 0.978 0.989
Scotland 0.959 0.979 0.974 0.987 0.980 0.990
Spain 0.946 0.972 0.962 0.981 0.969 0.984
Sweden 0.968 0.984 0.980 0.990 0.986 0.993
Switzerland (Fr.) 0.941 0.970 0.958 0.979 0.965 0.982
Switzerland (Ger.) 0.946 0.972 0.964 0.982 0.972 0.986
ρ3
Ou
tcom
es o
f m
arke
r re
liab
ilit
y st
ud
ies
171
Table 62: -Estimates for Interpreting Texts
Country I = 8 I = 16 I = 24M = 1 M = 2 M = 1 M = 2 M = 1 M = 2
Australia 0.924 0.961 0.951 0.975 0.962 0.980
Austria 0.918 0.957 0.948 0.973 0.961 0.980
Belgium (Fl.) 0.906 0.951 0.940 0.969 0.955 0.977
Switzerland (Fr.) 0.901 0.948 0.933 0.965 0.946 0.972
Switzerland (Ger.) 0.912 0.954 0.940 0.969 0.953 0.976
Denmark 0.912 0.954 0.944 0.971 0.957 0.978
England 0.942 0.970 0.965 0.982 0.975 0.987
Finland 0.955 0.977 0.971 0.985 0.978 0.989
France 0.827 0.905 0.865 0.927 0.881 0.937
Germany 0.922 0.960 0.952 0.975 0.9640 0.982
Greece 0.942 0.970 0.959 0.979 0.967 0.983
Hungary 0.934 0.966 0.960 0.980 0.971 0.985
Ireland 0.913 0.955 0.943 0.970 0.956 0.977
Iceland 0.929 0.963 0.953 0.976 0.963 0.981
Italy 0.890 0.942 0.923 0.960 0.939 0.968
Japan 0.960 0.979 0.974 0.987 0.981 0.990
Korea 0.927 0.962 0.950 0.975 0.961 0.980
Mexico 0.853 0.921 0.884 0.939 0.898 0.947
Netherlands 0.899 0.947 0.930 0.964 0.943 0.971
New Zealand 0.940 0.969 0.960 0.980 0.968 0.984
Portugal 0.939 0.969 0.964 0.982 0.974 0.987
Russian Federation 0.944 0.971 0.965 0.982 0.974 0.987
Scotland 0.937 0.968 0.960 0.979 0.969 0.984
Spain 0.957 0.978 0.975 0.987 0.982 0.990
Sweden 0.938 0.968 0.954 0.976 0.961 0.980
ρ3
PIS
A 2
00
0 t
ech
nic
al r
epor
t
172
Table 63: -Estimates for Reflection and Evaluation
Country I = 8 I = 16 I = 24M = 1 M = 2 M = 1 M = 2 M = 1 M = 2
Australia 0.850 0.919 0.893 0.944 0.911 0.954
Austria 0.806 0.893 0.850 0.919 0.869 0.930
Belgium (Fl.) 0.838 0.912 0.886 0.939 0.906 0.951
Denmark 0.786 0.880 0.832 0.908 0.852 0.920
England 0.897 0.946 0.935 0.966 0.950 0.974
Finland 0.918 0.957 0.948 0.973 0.961 0.980
France 0.774 0.873 0.817 0.899 0.835 0.910
Germany 0.835 0.910 0.873 0.932 0.889 0.941
Greece 0.934 0.966 0.954 0.977 0.962 0.981
Hungary 0.863 0.926 0.897 0.946 0.912 0.954
Iceland 0.846 0.917 0.888 0.941 0.906 0.951
Ireland 0.805 0.892 0.858 0.923 0.880 0.936
Italy 0.817 0.899 0.856 0.923 0.873 0.932
Japan 0.937 0.968 0.961 0.980 0.971 0.985
Korea 0.823 0.903 0.855 0.922 0.870 0.930
Mexico 0.721 0.838 0.760 0.864 0.777 0.875
Netherlands 0.736 0.848 0.795 0.886 0.821 0.902
New Zealand 0.887 0.940 0.925 0.961 0.940 0.969
Norway 0.913 0.954 0.962 0.981 0.983 0.991
Portugal 0.867 0.929 0.914 0.955 0.934 0.966
Russian Federation 0.849 0.919 0.881 0.937 0.895 0.944
Scotland 0.871 0.931 0.910 0.953 0.925 0.961
Spain 0.918 0.957 0.947 0.973 0.960 0.979
Sweden 0.867 0.929 0.909 0.952 0.927 0.962
Switzerland (Fr.) 0.826 0.905 0.871 0.931 0.889 0.942
Switzerland (Ital.) 0.836 0.910 0.882 0.937 0.901 0.948
ρ3
Ou
tcom
es o
f m
arke
r re
liab
ilit
y st
ud
ies
173
PIS
A 2
00
0 t
ech
nic
al r
epor
t
174
Inter-country RaterReliability Study - Outcome
Aletta Grisay
Overview of the Main Results
Approximately 1 600 booklets were submittedfor the inter-country rater reliability study, inwhich 41 796 student answers were re-scored bythe verifiers. In about 78 per cent of these cases,both the verifier and all four national markersagreed on an identical score, and in another 8.5 per cent of the cases, the verifier agreed withthe majority of the national markers.
Of the remaining cases, 5 per cent hadnational marks considered too inconsistent toallow comparison with the verifier’s mark, and 8 per cent were flagged and submitted toConsortium staff for adjudication. Inapproximately 5 per cent of these cases, theadjudicators found that there was no realproblem (marks given by the national markerswere correct), while 2.5 per cent of the markswere found to be too lenient and 1 per cent too harsh.
Hence, a very high proportion of the cases(91.5 per cent) showed substantial consistencywith the scoring instructions described in theMarking Guides. Altogether only 3.5 per cent ofcases were identified as either too lenient or tooharsh and 5 per cent were identified asinconsistent marks. These percentages aresummarised in Table 64.
Table 64: Summary of Inter-Country RaterReliability Study
Category Percentage
Agreement 91.5
Inconsistency 5.0
Harshness 1.0
Leniency 2.5
Note: Total cases = 41 796
While relatively infrequent, the inconsistent orbiased marks were not uniformly distributedacross items or countries, suggesting that scoringinstructions may have been insufficientlystringent for some items, and that some nationalscoring teams may have been less careful thanothers.
Main Results by Item
The qualitative analysis of the most frequentcases of marking errors may indicate someproblems in the scoring instructions that shedlight on the higher rate of inconsistenciesobserved for these items. For example:• Scoring instructions that were too complex.
For many items where a number of criteriahad to be taken into account simultaneouslywhen deciding on the code, markers oftentended to forget one or more of them. Forexample, in R120Q06: Student Opinions, theresponse (i) must contain some clear referenceto the selected author (Ana, Beatrice, Dieter,or Felix); (ii) must contain details showingthat the student understood the author’sspecific argument(s); (iii) must indicate howthese arguments agreed with the student’s ownopinions about space exploration; and (iv)must be expressed in the student’s own words.National markers often gave answers fullcredit if they commented on Ana’s concernsabout famine and disease or on Dieter’sconcerns about environment, but made noexplicit or implicit reference to spaceexploration. They also tended to give fullcredit to answers that did not sufficientlydiscriminate among two or more of the fiveauthors, or were not expressed in the student’sown words (i.e., they were mere quotations).
• Lack of precision in the scoring instructions.For example, in R119Q05: Gift, neither thescoring instructions nor the examplessufficiently explained the criteria to be appliedto decide on whether the student had actuallycommented on the last sentence, lastparagraph, or on the end of the story as awhole. Many answers that did not refer toany specific aspect described in the lastsentence were given full or partial credit bythe national markers, provided that somereference to the ‘ending’ was included.
• Many errors occurred in the marks given toresponses where the code depended on theresponse given to a previous item (e.g.,R099Q4B: Plan International).
• Cultural differences could explain why anumber of national markers tended to bereluctant to accept the understanding of thetext that was sometimes conveyed by theMarking Guide. In R119Q07: Gift, forexample, the idea that the cries of the panther
were mentioned by the author ‘to create fear’was often accepted, while this interpretationwas considered in the Marking Guide to be aclear case of code 0. In R119Q08: Gift, thenational markers often gave credit to answerssuggesting that the woman fed the panther toget rid of it or to avoid being eaten herself,
while the Marking Guide specified that pity orempathy were the only motivations to beretained for full credit.Table 65 presents detailed results by item.
Nine of the 26 items which are italicised in Table65 had total agreement of less than 90 per cent.
Table 66 gives the stem for each of these items.
Ou
tcom
es o
f m
arke
r re
liab
ilit
y st
ud
ies
175
Table 65: Summary of Item Characteristics for Inter-Country Rater Reliability Study
Percentage
Item Frequency Agreement Inconsistency Harshness Leniency
R070Q04 1618 89.7 6.5 1.5 2.2
R081Q02 1618 96.4 2.0 0.8 0.8
R081Q05 1618 88.5 6.9 1.5 3.2
R081Q06A 1617 87.8 8.1 1.7 2.5
R081Q06B 1618 87.0 8.6 2.0 2.4
R091Q05 1618 99.4 0.3 0.1 0.2
R091Q07B 1570 90.2 5.7 2.1 2.0
R099Q04B 1618 75.7 13.2 3.3 7.7
R110Q04 1618 98.1 1.2 0.3 0.4
R110Q05 1618 95.1 3.2 0.1 1.7
R119Q05 1617 78.3 13.6 3.2 4.9
R119Q07 1618 79.1 12.9 2.8 5.2
R119Q08 1618 85.2 8.5 1.9 4.4
R119Q09T 1618 90.5 5.1 2.1 2.3
R120Q06 1594 86.4 8.8 1.4 3.5
R219Q01E 1618 94.8 3.3 1.2 0.7
R219Q02 1617 95.7 2.7 0.6 1.1
R220Q01 1618 92.3 4.5 1.6 1.6
R236Q01 1569 92.8 2.1 0.7 4.4
R236Q02 1569 93.0 4.0 0.8 2.2
R237Q01 1618 97.2 1.2 0.4 1.2
R237Q03 1618 92.8 4.3 0.5 2.3
R239Q01 1569 99.0 0.6 0.1 0.3
R239Q02 1569 97.9 1.2 0.2 0.7
R246Q01 1617 98.2 0.9 0.4 0.6
R246Q02 1618 92.8 4.1 0.2 3.0
Note: Items in italics have a total agreement of less than 90 per cent.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
176
An analysis was conducted to investigatewhether the items with major marking problemsin a specific country showed particularly poorstatistics for that country in the item analysis.Table 67 presents the main inter-countryreliability results for the group of items wheremore than 25 per cent marking inconsistencieswere found in at least one country, together withfour main study indicators:• the item fit for that country;• whether the item difficulty estimate was
higher or lower than expected (0: no; 1: yes);and
• whether the item had a problem with scoreorder (cases where ability appeared to belower for higher scores (0: no; 1: yes)).Many items for which the inter-country
reliability study identified problems in thenational marking proved to have large fit indices(r = 0.40 between fit and per cent of markinginconsistencies, for N = 26 items × 27 nationalsamples = 702 cases). A moderately highcorrelation was also observed betweenpercentage of inconsistencies and problems inthe category order (r = 0.25).
One might also expect a significantcorrelation between the percentage of too-harshor too-lenient codes awarded by nationalmarkers and possible differential itemfunctioning. Actually the correlations, thoughsignificant, were quite low (r = 0.09 betweenFL_HARSH and HIGH_DELTA; and r = 0.17between FL_LENIE and LOW_DELTA). Thismay be due to the fact that in many countries,when an item showed a high percentage of too-lenient codes, it often also showed non-negligibleoccurrences of too-harsh codes.
Main Results by Country
Table 68 presents the detail of the results bycountry. In most countries (23 of 31), theinternational verifier or the adjudicator agreedwith the national marks in more than 90 per centof cases. The remaining eight countries—thosewith less than 90 per cent of the cases in theagreement category—are italicised in the table.
Four of these eight countries had quite a fewcases in the ‘too inconsistent’ category, butshowed no specific trend towards harshness orleniency.
Table 66: Items with Less than 90 Per Cent Consistency
Item Stem
R070Q04: Beach Does the title of the article give a good summary of the debatepresented in the article? Explain your answer.
R081Q05: Graffiti Why does Sophia refer to advertising?
R081Q06A: Graffiti Which of the two letter writers do you agree with? Explain youranswer by using your own words to refer to what is said in one orboth of the letters.
R081Q06B: Graffiti Which do you think is the better letter? Explain your answer byreferring to the way one or both letters are written.
R099Q04B: Plan International What do you think might explain the level of Plan International’sactivities in Ethiopia compared with its activities in other countries?
R119Q05: Gift Do you think the last sentence of “The Gift” is an appropriate ending?
R119Q07: Gift Why do you think the writer chooses to introduce the panther withthese descriptions?
R119Q08: Gift What does the story suggest was the woman’s reason for feeding thepanther?
R120Q06: Student Opinions Which student do you agree with most strongly?
Ou
tcom
es o
f m
arke
r re
liab
ilit
y st
ud
ies
177
Table 67: Inter-Country Reliability Study Items with More than 25 Per Cent Inconsistency Rate in at LeastOne Country
Country Item Inconsistency Harshness Leniency Fit High Low Abilities(%) (%) (%) Mean Fit Fit Not
Square Ordered
Austria R081Q06A 29.2 4.2 0.0 0.99 0 0 0
Canada (Fr.) R081Q06B 25.5 0.0 2.1 0.94 0 0 0
Canada (Fr.) R119Q07 29.8 0.0 2.1 1.36 0 0 1
Czech Republic R099Q04B 33.4 4.2 2.1 1.15 0 1 1
Czech Republic R119Q05 41.7 4.2 8.3 1.15 0 0 0
Czech Republic R119Q07 52.1 2.1 4.2 1.28 0 0 1
Denmark R070Q04 37.5 0.0 8.3 1.02 0 0 0
Denmark R099Q04B 62.6 0.0 18.8 1.19 0 0 0
Denmark R119Q05 27.1 0.0 10.4 1.24 0 0 0
Greece R099Q04B 35.5 2.1 14.6 1.32 0 1 0
Iceland R236Q01 62.5 0.0 62.5 0.97 0 1 0
Ireland R119Q05 39.6 0.0 2.1 1.22 0 0 1
Japan R099Q04B 35.4 4.2 22.9 1.19 0 0 0
Japan R119Q07 33.3 0.0 12.5 1.20 1 0 0
Korea R099Q04B 45.9 6.3 31.3 1.20 0 1 0
Korea R119Q05 37.5 8.3 16.7 1.04 1 0 0
Korea R119Q07 29.3 4.2 18.8 1.16 1 0 0
Mexico R091Q07B 39.6 22.9 12.5 1.17 1 0 0
Mexico R099Q04B 31.2 2.1 20.8 1.21 0 1 0
Netherlands R119Q08 39.6 0.0 12.5 1.28 0 0 1
Portugal R119Q05 25.1 2.1 4.2 1.23 0 0 0
Russian FederationR099Q04B 43.8 4.2 33.3 1.18 0 1 0
Russian FederationR119Q05 37.5 2.1 27.1 1.25 0 1 0
Russian FederationR119Q07 31.3 4.2 20.8 1.16 0 1 0
Sweden R081Q06B 27.2 6.3 4.2 0.96 1 0 0
United States R119Q07 29.1 8.3 8.3 1.33 0 1 1
However, the rate of ‘too lenient’ cases was 5 per cent or more in four countries: France (5 per cent), Korea (5 per cent), the RussianFederation (7 per cent) and Latvia (9 per cent).The details provided by the country reports seemto suggest that, in all four countries, scoringinstructions were not strictly followed for anumber of items. In the Russian Federation, forfive of the 26 items in Booklet 7, more than 10per cent of cases were marked too leniently. InKorea, more than 10 per cent of cases for oneitem were marked too harshly, while six otheritems were too leniently marked. In Latvia, four
items received very systematic lenient marks(between one-third and one-half of studentanswers were marked too leniently for theseitems), while a few others had both highproportions of cases which were either tooharshly or too leniently marked.
In both Russia and Latvia, there was someindication that, for a number of partial credit items,the markers may have used codes 0, 1, 2 and 3in a non-standard way. They gave code 3 to whatthey thought were the ‘best’ answers, 2 to answersthat seemed ‘good, but not perfect’, 1 to ‘poor,but not totally wrong’ answers, and 0 to ‘totally
wrong’ answers, without considering the specificscoring rules for each code provided in theMarking Guide. In Latvia, the NPM had decidednot to translate the Marking Guide into Latvian,on the assumption that all the markers weresufficiently proficient in English to use theEnglish source version.
Ten other countries had leniency problems forjust one or two items sometimes due to residualtranslation errors. In some of these cases, theproportion of incorrect marks may have beensufficient to affect the item functioning.
In general, harshness problems were veryinfrequent.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
178
Table 68: Inter-Country Summary by Country
Percentage
Country Frequency Consistent Inconsistent Harsh Lenient
Australia 1 248 92.4 5.6 1.4 0.6Austria 1 247 91.6 6.4 1.1 0.9Belgium (Fl.) 1 078 85.6 9.6 1.8 3.0Belgium (Fr.) 1 248 92.2 4.9 1.6 1.3Brazil 1 248 91.5 4.6 0.6 3.4Canada (Eng.) 1 248 91.4 6.7 0.7 1.2Canada (Fr.) 1 222 91.0 6.9 0.6 1.6Czech Republic 1 248 86.9 9.5 0.6 3.0Denmark 1 248 86.8 9.0 0.5 3.8England 1 248 93.3 5.0 0.4 1.3Finland 1 248 96.1 2.4 0.6 1.0France 1 196 80.2 12.5 2.3 5.0Germany 1 224 92.6 5.8 0.7 0.8Greece 1 247 90.5 3.3 1.7 4.6Hungary 1 274 92.9 2.3 2.4 2.4Iceland 1 248 91.7 5.4 0.3 2.6Ireland 1 248 92.5 5.0 0.6 1.8Italy 1 170 89.8 7.0 1.5 1.7Japan 1 248 95.0 1.8 0.7 2.5Korea 1 248 89.3 3.9 1.6 5.1Luxembourg 1 144 95.9 2.1 1.6 0.4Latvia 1 066 80.6 7.5 2.5 9.4Mexico 1 247 91.1 3.5 2.8 2.6Netherlands 1 248 91.3 5.8 1.4 1.5New Zealand 1 222 96.5 2.7 0.1 0.7Norway 1 248 92.9 4.4 0.6 2.1Poland 1 248 92.1 3.0 1.0 3.8Portugal 1 248 92.0 5.4 1.8 0.8Russian Federation 1 248 88.5 3.7 0.9 6.9Scotland 1 248 94.5 3.8 0.6 1.1Spain 1 248 94.1 3.4 0.8 1.8Sweden 1 200 93.2 4.0 1.6 1.3Switzerland 1 300 93.4 4.0 2.4 0.2United States 1 247 92.1 5.2 1.4 1.3
Note: Countries with less than 90 per cent of cases in the agreement category are in italics.
Data Adjudication
Ray Adams, Keith Rust and Christian Monseur
In January 2001, the PISA Consortium, theSampling Referee and the OECD Secretariat metto review the implementation of the survey ineach participating country and to consider:• the extent to which the country had met PISA
sampling standards;• the outcomes of the national centre and school
quality monitoring visits;• the quality and completeness of the submitted
data;• the outcomes of the inter-country reliability
study; and• the outcomes of the translation verification
process.The meeting led to a set of recommendations
to the OECD Secretariat about the suitability ofthe data from each country for reporting and usein analyses of various kinds, and a set of follow-up actions with a small number of countries.
This chapter reviews the extent to which eachcountry met the PISA 2000 standards. For eachcountry, a recommendation was made on the useof the data in international analyses and reports.Any follow-up data analyses that needed to beundertaken to allow the Consortium to endorsethe data were also noted.
Overview of Response RateIssues
Table 69 presents the results of some simplemodelling that was undertaken to consider theimpact of non-response on estimates of means. Ifone assumes a correlation between a school’spropensity to participate in PISA and studentachievement in that school, the bias in mean
scores resulting from various levels of non-response can be computed as a function of thestrength of the relationship between thepropensity to non-response and proficiency. 1
The rows of Table 69 correspond to differingresponse rates, and the columns correspond todiffering correlations between propensity andproficiency. The values give the mean scores thatwould be observed for samples with matchingcharacteristics.
The true mean and standard deviation wereassumed to be 500 and 100 respectively. InPISA, the typical standard error for a samplemean is 3.5. This means that any values in thetable of approximately 506 or greater have abias greater than a difference from the true valuethat would be considered statistically significant.Those values are shaded in Table 69.
While the true magnitude of the correlationbetween propensity and achievement is notknown, standards can be set to guard againstpotential bias by non-response. The PISA schoolresponse rate of 0.85 was chosen to protectagainst any substantial relationship betweenpropensity and achievement.
1 The values reported in Table 69 were computed usinga model that assumes a bivariate normal distributionfor propensity to respond and proficiency. Theexpected means were computed under the assumptionthat proficiency is only observed for schools whosepropensity to non-response is below a value that yieldsthe specified response rate.
Chapter 15
PIS
A 2
00
0 t
ech
nic
al r
epor
t
180
Figure 30 is a plot of the attained PISA schoolresponse rates. It shows a scatter-plot of theschool response rates (weighted) that wereattained by the PISA participants. Those countriesthat are plotted in the shaded region wereregarded as fully satisfying the PISA schoolresponse rate criterion.
As discussed below, each of the six countries(Belgium, the Netherlands, New Zealand, Poland,the United Kingdom and the United States) thatdid not meet the PISA criterion were asked toprovide supplementary evidence that would assistthe Consortium in making a balanced judgementabout the threat of non-response to the accuracyof inferences which could be made from the PISAdata.
The strength of the evidence that theConsortium required to consider a country’ssample as sufficiently representative is directlyrelated to distance of the achieved response ratefrom the shaded region in the figure.
Table 69: The Potential Impact of Non-Response on PISA Proficiency Estimates
Correlation Between Retentivity and ProficiencyResponse
Rate 0.08 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
0.05 517 521 541 562 583 603 624 644 665 686 706
0.10 514 518 535 553 570 588 605 623 640 658 675
0.15 512 516 531 547 562 578 593 609 624 640 655
0.20 511 514 528 542 556 570 584 598 612 626 640
0.25 510 513 525 538 551 564 576 589 602 614 627
0.30 509 512 523 535 546 558 570 581 593 604 616
0.35 508 511 521 532 542 553 563 574 585 595 606
0.40 508 510 519 529 539 548 558 568 577 587 597
0.45 507 509 518 526 535 544 553 562 570 579 588
0.50 506 508 516 524 532 540 548 556 564 572 580
0.55 506 507 514 522 529 536 543 550 558 565 572
0.60 505 506 513 519 526 532 539 545 552 558 564
0.65 505 506 511 517 523 528 534 540 546 551 557
0.70 504 505 510 515 520 525 530 535 540 545 550
0.75 503 504 508 513 517 521 525 530 534 538 542
0.80 503 503 507 510 514 517 521 524 528 531 535
0.85 502 503 505 508 511 514 516 519 522 525 527
0.90 502 502 504 506 508 510 512 514 516 518 519
0.95 501 501 502 503 504 505 507 508 509 510 511
0.99 500 500 501 501 501 501 502 502 502 502 503
Follow-up Data AnalysisApproach
After reviewing the sampling outcomes, theConsortium asked countries falling below thePISA school response rate standards to provideadditional data or evidence to determine whetherthere was a potential bias. Before describing thegeneral statistical approach that the Consortiumadopted, it is worth noting that the existence ofa potential bias can only be assessed for thevariables used in the analyses and does notexclude that the sample is biased with respect toanother variable.
In most cases, the Consortium worked withnational centres to undertake a range of logisticregression analyses. Logistic regression is widelyused for selecting models for variables with onlytwo outcomes. In the analysis recommended bythe Consortium, the dependent variable is theresponse or non-response of a particular sampledschool. Suppose, for simplicity’s sake, that onlytwo characteristics are known for theschool region and urban status. Let be theresponse outcome for the kth school in region i
( ) with urban status j (j = 0,1). Thendefine:
= 1 if school k participates in thesurvey,
= 0 if school k does not participatein the survey.
Under the quasi-randomisation model (see, forexample, Oh and Scheuren, 1983), response ismodelled as another phase of sampling,assuming that the are independent randomvariables, with probability for outcome 1and probability for outcome 0. is theschool’s probability of response, or its responsepropensity.
The logistic regression model goes a step beyondthis to model and is specified as follows:
,
which is algebraically equivalent to:
.pijki j
i j
=+ +
+ + +exp( )
exp( )
µ α βµ α β1
lnp
pijk
ijki j1 −
= + +µ α β
pijk
pijk1 − pijk
pijk
Yijk
Yijk
i = …1 4, ,
Yijk
Dat
a ad
jud
icat
ion
181
Figure 30: Plot of Attained PISA School Response Rates
Scho
ol R
espo
nse
Rat
e A
fter
Rep
lace
men
t
School Response Rate Before Replacement
United States
United Kingdom
Netherlands
Poland
BelgiumNew Zealand
100
90
80
70
60
50
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90 100
The quantity is the odds that the
school participates. A logarithm gives a multi-plicative rather than an additive model that isalmost universally preferred by practitioners(see, for example, McCullagh and Nelder, 1983).The quantities on the right-hand side of theequation define parameters that are thenestimated using maximum likelihood methods.
There is concern that non-response bias maybe a significant problem if any of the or terms is substantial since this indicates thatregion and/or urban status are associated withnon-response.
Countries occasionally also provided otherstatistical analyses in addition to or in place ofthe logistic regression analysis.
Review of Compliancewith Other Standards
It is important to recognise that the PISA dataadjudication is a late but not necessarily finalstep in a quality assurance process. By the timeeach country was adjudicated, quality assurancemechanisms (such as the sampling proceduresdocumentation, translation verification, datacleaning and site visits) had identified a range ofissues and ensured that they had beenrectified at least in the majority of cases.
Details on the various quality assuranceprocedures and their outcomes are documentedelsewhere (see Chapter 7 and Appendix 5). Dataadjudication focused on residual issues thatremained after these quality assurance processes.There were not many such issues and theirprojected impact on the validity of the PISAresults was deemed to be negligible.
Unlike sampling issues, which under mostcircumstances could directly affect all of acountry’s data, the residual issues identified inother areas have an impact on only a smallproportion of the data. For example, markingleniency or severity for a single item in readinghas an effect on between just one-third and one-half of 1 per cent of the reading data and evenfor that small fraction, the effect would beminor.
Detailed CountryComments
Summary
Concerns with sampling outcomes and complianceproblems with PISA standards resulted in recom-mendations to place constraints on the use of thedata for just two countries the Netherlands andJapan. The Netherlands’ response rate was verylow. It was therefore recommended to exclude thedata from the Netherlands from tables and chartsin the international reports that focus on the levelor distribution of student performance in one ormore PISA domain, or on the distribution of acharacteristic of schools or students. Japan sampledintact classes of students. It was recommended,therefore, that data from Japan not be includedin analyses that require disentangling variancecomponents.
Two countries—Brazil and Mexico—had note-worthy coverage issues. In Brazil, it was deemed un-reasonable to assess 15-year-olds in very low grades,which resulted in about 69 per cent coverage of the15-year-old population. In Mexico, just 52 per centof 15-year-olds are enrolled in schools muchlower than all other participating countries.
In two countries—Luxembourg and Poland—higher than desirable exclusions were noted.Luxembourg had a school-level exclusion rate of9.1 per cent. The majority of these excludedstudents (6.5 per cent) were instructed in languagesother than the PISA test languages. In Poland,primary schools accounting for approximately 7 per cent of the population were excluded.
For all other participants, the problemsidentified and discussed below were judged to besufficiently small that their impact oncomparability was negligible.
Australia
A minor problem occurred in Australia in thatsix (of 228) schools would not permit a sampleto be selected from among all PISA-eligiblestudents, but restricted it in some way, generallyby limiting it to those students within the modalgrade. Weighting adjustments were made topartially compensate for this. This problem wastherefore not regarded as significant andAustralia fully met the PISA standards, so thatinclusion in the full range of PISA reports wasrecommended.
β jα i
p
pijk
ijk1 −
PIS
A 2
00
0 t
ech
nic
al r
epor
t
182
Austria
In Austria, students in vocational schools areenrolled on a part-time/part-year basis. Sincethe PISA assessment was conducted only at onepoint in time, these students are captured onlypartially. Thus, it is not possible to assess howwell the students sampled from vocationalschools represent the universe of studentsenrolled in vocational schools, and so thosestudents not attending classes a the time of thePISA assessment are not represented in the PISAresults.Additionally, Austria did not follow the correctprocedure for sampling small schools. This ledto a need to trim the weights. Consequently,small schools may be slightly under-represented.It was concluded that any effect of this under-representation would be very slight, and the datawere therefore suitable for inclusion in the fullrange of PISA reports.
Belgium
The school-weighted participation rate for theFlemish Community of Belgium was quite low:61.5 per cent before replacement and 79.8 per centafter replacement. However the French Communityof Belgium was very close to the internationalstandard before replacement (79.9 per cent) andeasily reached the standard after replacement(93.7 per cent). For Belgium as a whole, theweighted school participation rate before replace-ment was 69.1 per cent and 85.5 per cent afterreplacement. Belgium did, therefore, not quitemeet the school response rate requirements.
The German-speaking community, whichaccounts for about 1 per cent of the PISApopulation, did not participate.2 As the schoolresponse rate was below the PISA 2000standard, additional evidence was sought fromthe Flemish Community of Belgium with regardto the representativeness of its sample. TheFlemish community provided additional datawhich included a full list of schools (samplingframe) containing the following information oneach school:
Dat
a ad
jud
icat
ion
183
2 In fact, a sample of students from the German-speaking community was tested, but the data were notsubmitted to the international Consortium.
• whether it had been sampled;• whether it was a replacement;• its stratum membership;• participation status; and• percentage of over-aged students.
In Belgium, compulsory education starts forall children in the calendar year of the child’ssixth birthday. Thus, the student’s expected gradecan be computed from knowing the birth date.
The Belgian educational systems use graderetention to increase the homogeneity of studentabilities within each class. In most cases, a studentwho does not reach the expected performance levelwill not graduate to the next grade. Therefore, acomparison of the expected and actual gradeprovides an indication of student performance.
Table 70 gives the performance on thecombined reading literacy scale for the threemost frequent grades in the Flemish studentsample and the percentage of 15-year-oldsattending them.
Table 70: Performance in Reading by Grade in theFlemish Community of Belgium
Grade Percentage Readingof Students Performance
8 2.5 363.59 22.8 454.7
10 72.2 564.3
The unweighted correlation between schoolmean performance for the combined readingliteracy scale and the percentage of over-agedstudents is equal to –0.89. The school percentageof over-aged students constitutes a powerfulsurrogate of the school mean performance, andwas therefore used to see if student achievementin responding schools was likely to besystematically different from that in non-responding schools.
Table 71 shows the mean of the school over-aged percentages according to schoolparticipation status and sample status (originalor replacement). The data used to compute theseresults are based only on regular schools. Datafrom special education schools and part-timeeducation schools were not included because alloriginally sampled students from these strataparticipated in PISA.
The data included in Table 71 show thatschools with lower percentages of over-agedstudents were more likely to participate thanschools with higher percentages of over-agedstudents. It also appears that the addition ofreplacement schools seems to have reduced thebias introduced by the original schools thatrefused to participate.
Further, additional analyses showed that thebias was correlated with stratification variables,that is the type of school (academic, technical orvocational) and the school board. The school-level non-response adjustment applied duringweighting would thus have reduced the bias.
Given the marginal amount by which Belgiumdid not meet the sampling criteria, and thesuccess of correcting for potential bias byreplacements and non-response adjustments, itwas concluded that there was no cause forconcern about the potential of significant non-response bias in the PISA sample for Belgium.
It was also noted that the Flemish Communityof Belgium did not adhere strictly to the requiredsix-week testing window. All testing sessionswere conducted within a three-month testingperiod and therefore none violated the definitionof the target population. Variance analyses wereperformed according to the testing period andno significant differences were found.
It was concluded that, while Belgium deviatedslightly from PISA standards, it could berecommended that the data be used in the fullrange of PISA analyses and reports.
Brazil
In Brazil, 15-year-olds are enrolled in a widerange of grades. It was not deemed appropriate
PIS
A 2
00
0 t
ech
nic
al r
epor
t
184
to administer the PISA test to students in grades1 to 4. This decision was made in consultationwith Brazil and the Consortium and resulted inabout 69 per cent coverage of the 15-year-oldpopulation.
For many secondary schools in Brazil, anadditional sampling stage was created betweenthe school and student selection. Manysecondary schools operate in threeshifts morning, afternoon and evening. Anapproved sampling procedure was developedthat permitted the selection of one shift (or twoif there was a small shift) at random, with 20students being selected from the selected shifts.
Weighting factors were applied to the studentdata to reflect this stage of sampling. Thus, if oneshift was selected in a school, the student datawere assigned a factor of three in addition to theschool and student base weight components. Afterweighting, the weighted total student populationwas considerably in excess of the PISA populationsize as measured by official enrolment statistics. Itwas concluded that, in many cases, either or bothof the following occurred: the shift sampling wasnot carried out according to the procedures; orthe data on student numbers within each schoolreflected the whole school and not just theselected shifts. Since which errors occurred inwhich schools could not be determined, theseproblems could not be addressed in weighting.However, there was no evidence of a systematicbias in the sampling procedures.
Canada
Canada fully met the PISA standards, andinclusion in the full range of PISA reports wasrecommended.
Czech Republic
The Czech Republic fully met the PISAstandards, and inclusion in the full range of PISAreports was recommended.
Denmark
Denmark fully met the PISA standards, andinclusion in the full range of PISA reports wasrecommended.
Table 71: Average of the School Percentage ofOver-Aged Students
School Sample and Number MeanParticipation Status of Schools Percentage
Original sample 145 0.288
All participating schools 119 0.262
Participating schools from the original sample 88 0.255
Non-participating schools from the original sample 57 0.340
Finland
Finland fully met the PISA standards, andinclusion in the full range of PISA reports wasrecommended.
France
France did not supply data on the actual numberof students listed for sampling within each school,as required under PISA procedures for calculatingstudent weights. Rather, France provided thecurrent year enrolment figures for enrolled 15-year-olds as recorded at the Ministry. The datashowed that the two quantities were sometimesslightly different, but no major discrepancieswere detected. Thus, this problem was notregarded as significant.
Results from the inter-country reliability studyindicated an unexpectedly high degree of variationin national ratings of open-ended items. The inter-country reliability study showed that agreementbetween national markers and internationalverifiers was 80.2 per cent—the lowest level ofagreement among all PISA participants. Of theremaining 19.8 per cent, 12.5 per cent of theratings were found to be inconsistent, 2.3 per centtoo harsh and 5.0 per cent too lenient.
Given that this rater inconsistency was likelyto affect only a small proportion of items, andbecause the disagreement included both harshnessand leniency deviations, it was concluded thatno systematic bias would exist in scale scoresand that the data could be recommended forinclusion in the full range of PISA reports.
Germany
Germany fully met the PISA standards, andinclusion in the full range of PISA reports wasrecommended.
Greece
The sampling frame for Greece had deficientdata on the number of 15-year-olds in eachschool. This resulted in selecting many schools inthe sample with no 15-year-olds enrolled, or thatwere closed (34 of a sample of 175 schools).Often the replacement school was then included,which was not the correct PISA procedure forsuch cases. Also quite a few sampled schools(a further 34) had relatively few 15-year-olds.
Rather than being assessed or excluded, theseschools were also replaced, often by larger schools.This resulted in some degree of over-coverage inthe PISA sample, as there were 18 replacementschools in the sample for schools that should nothave been replaced, and 32 replacements forsmall schools where the replacement was generallylarger than the school it replaced. It is not possibleto evaluate what kinds of students might beover-represented, but since the measures of sizewere unreliable, this is unlikely to result in anysystematic bias. Thus, this problem was notregarded as significant, and inclusion in the fullrange of PISA reports was recommended.
Hungary
Hungary did not strictly follow the correctprocedure for the sampling of small schools.However, this had very little negative impact onsample quality and thus inclusion in the fullrange of PISA reports was recommended.
Iceland
Iceland fully met the PISA standards, andinclusion in the full range of PISA reports wasrecommended.
Ireland
Ireland fully met the PISA standards, andinclusion in the full range of PISA reports wasrecommended.
Italy
Italy fully met the PISA standards, and inclusionin the full range of PISA reports wasrecommended.
Japan
The translation of the assessment material inJapan was somewhat deficient. A rather highnumber of flawed items was observed in boththe field trial and main study. However, carefulexamination of item-by-country interactions anddiscussion with the Japanese centre resulted inthe omission of only one item (see Figure 25),where there was a problematic interaction. Noremaining potential problem led to detectable
Dat
a ad
jud
icat
ion
185
item-by-country interactions.Japan did not follow the PISA 2000 design,
which required sampling students rather thanclasses in each school. This means that the between-class and between-school variance components willbe confounded. The Japanese sample is suitablefor PISA analyses that do not involve disentanglingschool and class variance components, but itcannot be used in a range of multilevel analyses.
It was recommended, therefore, not to includedata from Japan in analyses that requiredisentangling variance components, but the datacan be included in all other analyses and reports.
Korea
Korea used non-standard translation procedures,which resulted in a few anomalies. When reviewingitems for inclusion or exclusion in the scaling ofPISA data (based upon psychometric criteria),the items were carefully reviewed for potentialaberrant behaviour because of translation errors. Fourreading items were deleted because of potentialproblems, and the remaining data were recommendedfor inclusion in the full range of PISA reports.
Latvia
Latvia did not follow the PISA 2000 translationguidelines, and the international coding studyshowed many inconsistencies between coding inLatvia and international coding. In the inter-countryreliability study, the agreement between nationalmarkers and international verifiers was 80.6 percent. Of the remaining 19.4 per cent, 2.5 per centof the ratings were found to be too harsh and9.4 per cent too lenient. These results may be dueto the fact that the marking guide was nottranslated into Latvian on the assumption thatall markers were sufficiently proficient in Englishto use its English version.
Since this rater inconsistency was likely to affectonly a small proportion of items, and becausethe disagreements included both harshness andleniency deviations, it was concluded that thedata could be recommended for inclusion in thefull range of PISA reports.
Liechtenstein
Liechtenstein fully met the PISA standards, and inclu-sion in the full range of PISA reports was recommended.
Luxembourg
The school level exclusion rate of 9.1 per cent inLuxembourg was somewhat higher than thePISA requirements. The majority of excludedstudents (6.5 per cent) were instructed inlanguages other than the PISA test languages(French and German) and attended the EuropeanSchool. As this deviation was judged to have nomarked effect on the results, it was concludedthat Luxembourg could be included in the fullrange of PISA reports.
Mexico
Mexico fully met the PISA standards, andinclusion in the full range of PISA reports wasrecommended. However, the proportion of 15-year-olds enrolled in schools in Mexico wasmuch lower than in all other participatingcountries (52 per cent).
Netherlands
The international design and methodology werewell implemented in the Netherlands. Unfortunately,school participation rates were not acceptable.Though the initial rate of 27.2 per cent was farbelow the requirement to even considerreplacement (65 per cent), the Netherlands wasallowed to try to reach an acceptable responserate through replacements. However, thepercentage after replacement (55.6 per cent) wasstill considerably below the standard required.
The Consortium undertook a range ofsupplementary analyses jointly with the NPM inthe Netherlands, which confirmed that the datamight be sufficiently reliable to be used in somerelational analyses. The very low response rate,however, implied that the results from the PISAdata for the Netherlands cannot, with any confid-ence, be expected to reflect the national populationto the level of accuracy and precision required.This concern extends particularly to distributionand sub-group information about achievement.
The upper secondary educational system in theNetherlands consists of five distinct tracks. VWOis the most academic track, and on a continuumthrough general to vocational education, tracksare HAVO, MAVO, and (I)VBO. Many schoolsoffer more than one track, but few offer all four.Thus, for example, of the 95 schools thatparticipated in PISA which could be matched to
PIS
A 2
00
0 t
ech
nic
al r
epor
t
186
the national database of examination results, 38offered (I)VBO, 76 offered MAVO, 64 offeredHAVO and 66 offered VWO. Each part of theanalysis therefore applied to only a sub-set of theschools.
Depending upon the track, students take theseexaminations at the end of grades 10, 11, or 12.The PISA population was mostly enrolled ingrades 9 and 10 in 2000 (and thus is neither thesame cohort, nor, generally, at the sameeducational level as the population for whichnational examination results are available). Theschool populations for these two groups ofstudents, however, are the same.
To examine the characteristics of the PISA2000 sample in the Netherlands (at the schoollevel), the NPM was able to access the results ofnational examinations in 1999. Data fromnational examinations were combined for thetwo vocational tracks (IVBO and VBO), butother than this combination, available data werenot comparable across tracks, which made itdifficult to develop overall summaries of thequality of the PISA school sample. The analysesof data showed two types of outcomes. The firstdealt with the percentage of students taking atleast three subjects at C or D level3 in the(I)VBO track. This analysis provided someevidence that, when compared to the originalsample, the participating schools had a lowerpercentage of such students than the non-participants and the rest of the population.However, there were only 18 participants of 69in the sample, and so the discrepancy was notstatistically significant. There is evidence that theinclusion of the replacement schools (20participated) effectively removed any suchimbalance. Since this track covers only about 40per cent of the schools in the population andsample, little emphasis was placed on this result.
A second analysis was made of the averagegrade point of the national examinations foreach school and for each track, overall and forseveral subject groupings. The average gradepoint average across Dutch language,mathematics and science was used. The resultsare available for each track, but are not compar-able across tracks (as evidenced, for example, bythe fact that students take the examinations at
Dat
a ad
jud
icat
ion
187
different grades in different tracks).For the (I)VBO track, the average grade point
across all subjects is 6.17 for PISA participantsand 6.29 for the rest of the population, adifference of about 0.375 standard deviations,that is statistically significant (p = 0.034). Thisphenomenon is not reflected in the results forDutch language, but is seen also in the resultsfor mathematics and science. Here the mean forparticipants is 5.83, and 6.03 for the rest of thepopulation. This difference is about 0.45standard deviations, and is statistically significant(p = 0.009). These results suggest that PISA datamay somewhat under-estimate the overallproficiency of students in the Netherlands, at leastfor mathematics and science.
In the VWO track, the population variance ofthe PISA participants is considerably smallerthan for the rest of the population, for allsubjects combined, for Dutch language, and formathematics and science. Thus the PISAparticipants appeared to be a considerably morehomogeneous group of schools than thepopulation as a whole, even though their meanachievement is only trivially different. Thus, thevariance of the rest of the population is 76 per centhigher than the PISA participants for all subjectscombined, 39 per cent higher for Dutch language,and 93 per cent higher for mathematics and science.This is also seen in the HAVO track for all subjectsand for Dutch language, and the MAVO trackfor mathematics and science. These results suggestthat the Dutch PISA data might well show asomewhat more homogeneous distribution ofstudent achievement than is true for the wholepopulation, thus leading to biased estimates ofpopulation quantiles, percentages above variouscut-off points, and sub-group comparisons.
Considering the extent of sample non-response, it was noted that the results of thisanalysis did show a high degree of similaritybetween participating schools and the rest of thepopulation on the one hand, and the non-participants on the other. However, since theresponse rate is so low, even minor differencesbetween these groups are of concern.
An additional factor to consider in judging thequality of the PISA sample for the Netherlands isthe actual sample size of students in the PISAdata file. PISA targeted a minimum of 4 500assessed students for each country. Not all countriesquite reached this target, primarily because of
3 Difficulty levels in the national examinations.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
188
low levels of school participation, but the samplesize for the Netherlands is only 56 per cent of thistarget (2 503 students). Bias considerations aside,the sample cannot therefore give results to thelevel of precision envisaged in specifying the PISAdesign. This is especially serious for mathematicsand science, since, by design, only 55.6 per centof the PISA sample were assessed in each ofthese subjects.
In summary, it was concluded that the resultsfrom the Netherlands PISA sample may not havebeen grossly distorted by school non-participation.In fact the correspondence in achievement onnational examinations between the participantsand the population is remarkably high given thehigh level of non-participation. However, the highlevel of school non-participation means that evenreasonably minor differences between theparticipants and the intended sample can meanthat the non-response bias might significantlyaffect the results. The findings of follow-upanalyses above, together with the lowparticipation rate and the small final sample size,meant that the Consortium could not haveconfidence that the results from the PISA datafor the Netherlands reflected the nationalpopulation to the level of accuracy and precisioncalled for in PISA, particularly for distributionand sub-group information about achievement.
It was recommended, therefore, to exclude datafrom the Netherlands from tables and charts inthe international reports that focus on the levelor distribution of student performance in one ormore of the PISA domains, or on the distributionof a characteristic of schools or students. Datafrom the Netherlands could be included in theinternational database and the Netherlandscould be included in tables or charts that focusprimarily on relationships between PISAbackground variables and student performance.
New Zealand
The before-replacement response rate in NewZealand was 77.7 per cent, and the after-replacement response rate was 86.4 per cent,slightly short of the PISA 2000 standards.
Non-response bias in New Zealand wasexamined by looking at national Englishassessment results for Year 11 students in allschools, where 87.6 per cent of the PISApopulation can be found.
A logistic regression approach was alsoadopted by New Zealand to analyse theexistence of a potential bias. The dependentvariable was the school participation status, andthe independent variables in the model were:• school decile, an aggregate at the school level
of the social and economic backgrounds ofthe students attending the school;
• rural versus urban schools;• the school sample frame stratification
variables, i.e., very small school, certainty4 orvery large schools and other schools;
• Year 11 student marks in English on thenational examination; and
• initially, the distinction between public andprivate schools, but this variable was removeddue to the size of the private schoolpopulation.Prior to performing the logistic regression,
Year 11 student marks were averaged accordingto school participation status, and (i) schooldecile, (ii) public versus private school, (iii) ruralversus urban school, and (iv) the school sizestratification variable from the school samplingframe.
The unweighted logistic regression analysesare summarised in Table 72. There is noevidence that low school response rate intro-duced bias into the school sample for New Zealand.
Table 72: Results for the New Zealand LogisticRegression
Parameter Estimate Chi-Squared p-value
English marks 0.0006 0.0229 0.98
Very small school -1.3388 1.4861 0.37
Certainty or very large schools 0.5460 0.5351 0.30
Urban -0.1452 0.4104 0.72
Based on the logistic regression and the sub-group mean achievement, it is possible thatstudents in the responding schools are slightlylower achievers than those in the non-respondingschools. But the difference in the average resultsfor responding and non-responding schools was0.04 standard deviations, which is not
4 Schools with a selection probability of 1.0.
significant. Further, after the effects of the keystratification variables, which were used to makeschool non-response adjustments, there is nonoticeable effect of school mean achievement onPISA participation. Furthermore, there is noevidence that the responding and non-respondingschools might differ substantially in theirhomogeneity (as distinct from the case in theNetherlands).
It was recommended, therefore, that New Zealand data could be included in the fullrange of PISA reports.
Norway
Norway fully met the PISA standards, andinclusion in the full range of PISA reports wasrecommended.
Poland
Poland did not follow the international samplingmethodology. First, Poland did not follow theguidelines for sampling 15-year-olds in primaryschools. Consequently, the small sample ofstudents from these schools, which includedapproximately 7 per cent of the 15-year-oldpopulation, have been excluded from the data.These students were very likely to have been wellbelow the national average in achievement.
Second, Poland extended the testing period,and 14 schools tested outside the three-monthtesting window that was required to ensure theaccurate estimation of student age at testingtime. Analysis showed that schools testingoutside the window performed better thanschools testing inside the window, and theformer were dropped from the nationaldatabase.
Strong evidence was provided that schoolnon-response bias is minimal in the PISA datafor Poland, based on the following:• The majority of school non-response in Poland
was not actual school non-response, but theconsequence of schools testing too late.
• Schools testing outside the window performedbetter than those that tested inside the window,which vindicates the decision to omit themfrom the data, since their higher performancemay have been due to the later testing.
• The results with the late schools omitted areabout one scale score point lower than they
would be had these late schools been included.This indicates that had these schools testedduring the window and been included, theresults would have been less than one pointabove the published result. It seems extremelylikely that the result would have fallen betweenthe published result and one point higher.
• The bias from the necessary omission of thelate assessed schools seems certain to be lessthan one point, which is insignificantconsidering the sampling errors involved.
• All non-responding schools (i.e., late-testingschools plus true non-respondents) match theassessed sample well in terms of big city/smallcity/town/rural representation.The relatively low level of non-response
(i.e., Poland did not fall far below the acceptablestandard) indicates that no significant bias hasresulted from school non-response.
It was concluded that while the exclusion ofthe primary schools should be noted, data couldbe recommended for inclusion in the full rangeof PISA reports, provided the difference betweenthe target population in Poland and the inter-national target population was clearly noted.
Portugal
Portugal fully met the PISA standards, andinclusion in the full range of PISA reports wasrecommended.
Russian Federation
The adjudication noted four concerns about theimplementation of PISA 2000 in the RussianFederation. First, material was not translatedusing the standard procedures recommended inPISA, and the number of flawed items wasrelatively high in the main study. Second, in theinter-country reliability study, the agreementbetween national markers and internationalverifiers was 88.5 per cent, which is just belowthe 90 per cent required. Of the remaining 11.5per cent, 0.9 per cent of the ratings were foundto be too harsh, 7.0 per cent too lenient and 3.6per cent inconsistent. The Consortium discussedwith the Russian Federation national centre aboutthe possibility of re-scoring the open-ended itemsbut concluded that this would have a negligibleeffect on country results. In making this assessment,the Consortium carefully examined the psycho-
Dat
a ad
jud
icat
ion
189
metric properties of the Russian Federation itemsand did not identify an unusual number of item-by-country interactions. Third, the Student TrackingForms did not include students excluded withinthe school, meaning that it was not possible toestimate the total exclusion rate for the RussianFederation. Fourth, the correct procedure forsampling small schools was not followed whichled to a minor requirement to trim the weights,making small schools very slightly under-represented in the weighted data.
It was concluded that while these issues shouldbe noted, the data could be recommended forinclusion in the full range of PISA reports.
Spain
Spain did not follow the correct procedure forsampling small schools. This led to a need totrim the weights; small schools consequentlymay be slightly under-represented. This problemis not regarded as significant, and inclusion inthe full range of PISA reports was recommended.
Sweden
Sweden fully met the PISA standards, and inclusionin the full range of PISA reports was recommended.
Switzerland
Switzerland included a national option to surveygrade 9 students and to over-sample them insome parts of the country. Thus, there wereexplicit strata containing schools with grade 9,and other strata containing schools with 15-year-olds but no grade 9 students.
From the grade 9 schools, a sample of schoolswas selected in some strata within which asample of 50 students was selected from amongthose in grade 9, plus any other 15-year-olds.
In other strata, where stratum-level estimatesfor grade 9 students were required from the PISAassessment, a relatively large sample of schoolswas selected. In a sub-sample of these schools(made via systematic selection), a sample of 50grade 9 students, plus 15-year-old students, wasselected. In the remaining schools, a sample of 35students was selected from among the grade 9students only (regardless of their age).
In the strata of schools with no grade 9, asample of schools was selected, from each of which
a sample of 35 15-year-old students was selected.Thus, the sample contained three groups of
students: those in grade 9 aged 15, those in grade 9not aged 15, and those aged 15 not in grade 9.Within some strata, the students in the third groupwere sampled at a lower rate (considering theschool and student-within-school sampling ratescombined) than students in grade 9. The PISAweighting procedure ensured that these differentialsampling rates were taken into account whenanalysing the 15-year-old samples, grade 9 andothers combined. This procedure meant thatunweighted analyses of PISA data in Switzerlandwould be very likely to lead to erroneousconclusions, as grade 9 students were over-represented in the sample.
This entire procedure was fully approved andcarried out correctly. However, there was a minorproblem with school non-response. In one stratumwith about 1 per cent of the PISA population inSwitzerland, 35 schools in the population had bothgrade 9 and other 15-year-old students. All 35schools were included in the grade 9 sample, anda sub-sample of two of the schools was selected toinclude other 15-year-old students in the sample aswell. The school response rate in this stratum wasvery low. One of the two schools that were toinclude all 15-year-olds participated, and only threeof the other 33 schools participated. This very lowresponse rate (with a big differential between thetwo parts of the sample) meant that the standardapproach to creating school non-responseadjustments would not have worked appropriately.Therefore, schools from this explicit stratum hadto be combined with those from another stratumto adjust school non-response. The effects on theoverall results for Switzerland are very minor, asthis stratum is so small.
The translation verifier noted that the Swiss-German version of the material underwent sub-stantial national revisions after the field trial(most of which were not documented), while thechanges introduced in the source versions andthe corrections suggested by the verifier wereoften overlooked. As a result, the Swiss-Germanversion differed significantly from the otherGerman versions of the material, and hadslightly more flawed items.
It was concluded that none of the deviationsfrom the PISA standards would noticeably affectthe results, and inclusion in the full range ofPISA reports was recommended.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
190
United Kingdom
The adjudication revealed three potentialproblems with data from the United Kingdom.First, schools in Wales were excluded. Second,the school response rate prior to replacement (61.3 per cent) was below the PISA 2000standard of 65 per cent—the response rate afterreplacements was 82.1 per cent. Further analysesof the potential for school non-response biaswere therefore conducted. Third, testadministration procedures in Scotland did notrespect the PISA requirements. Testadministrators were neither trained nor given anadequate script. This is considered to be a matterof serious concern, but Scotland has only a smallpercentage of 15-year-olds in the UnitedKingdom, and this violation did not seriouslyaffect the United Kingdom’s results. In addition,student sampling procedures were not correctlyimplemented in Scotland. The unapprovedprocedure implemented was dealt with and isvery unlikely to have biased the UnitedKingdom’s results, but it did increase the numberof absent students in the data.5
The potential influence of the exclusion ofschools in Wales was considered by comparingthe distribution of General Certificate ofSecondary Education (GCSE) results for Englandand Wales. The comparison showed that thedistribution of achievement in Wales isessentially the same as in England. Further, sinceWales has only 5.1 per cent of the UnitedKingdom’s population, while England has 82.4 per cent, it did not seem credible that thedistribution in Wales could be so different as toadd a noticeable bias.
The primary source of the problems with schoolresponse was in England, where the response ratebefore replacement was 58.8 per cent. The NPMtherefore carried out some regression analyses inorder to examine which factors were associatedwith school non-response (prior to replacement)in England. The possibility of bias in the initialsample of schools, before replacement of
Dat
a ad
jud
icat
ion
191
refusing schools, was examined by evaluatingindependent variables as predictors of schoolnon-response. These included educationaloutcomes of the schools at GCSE level (publicexaminations taken at age 16); and socio-economic status as measured by the percentageof pupils eligible for free school meals—a widelyaccepted measure of poverty in education studiesin England.
In addition, sample stratification variableswere included in the models. Since these wereused to make school non-response adjustments,the relevant question is whether GCSE level andthe percentage of students eligible for free mealsat school are associated with school responsewhen they are also included in the model.
The conclusions were that, in view of thesmall coefficients for the response status variableand their lack of statistical significance (p = 0.426 and 0.150, respectively), there was nosubstantial evidence of non-response bias interms of the GCSE average point score (inparticular), or of the percentage of studentseligible for free school meals. Participatingschools were estimated to have a mean GCSElevel only 0.734 units higher than those that didnot, and a percentage of students eligible for freeschool meals that was only 0.298 per cent lower.
A further potential problem, in addition to thethree discussed above, was that 11 schools (of 349)restricted the sample of PISA-eligible students insome way, generally by limiting it to those studentsin the lower of the two grades with PISA-eligiblestudents. However, weighting adjustments wereable to partially compensate for this.
It was concluded that while these issues shouldbe noted, the data could be recommended forinclusion in the full range of PISA reports.
United States
The before-replacement response rate in theUnites States was 56.4 per cent, and the after-replacement response rate was 70.3 per cent,which fell short of the PISA 2000 standards. TheNPM explored this matter extensively with theConsortium and provided data as evidence thatthe achieved sample was representative.
It was not possible to check for systematicdifferences in the assessment results as theUnited States had no assessment data from thenon-responding schools, but it was possible to
5 Scotland sampled 35 students from each school. Thefirst 30 students were supposed to be tested and the lastfive students were considered as replacement students ifone or more students could not attend the testingsession. The Consortium dealt with this deviation byregarding all sampled students who were not tested asabsentees/refusals.
check for systematic differences in the schoolcharacteristics data available for both theresponding and non-responding schools. It iswell known from past surveys in the UnitedStates that characteristics such as region, urbanstatus, percentage of minorities, public/privatestatus, etc., are correlated with achievement, sothat systematic differences in them are likely toindicate the presence of potential bias inassessment estimates.
The potential predictors for school non-participation evaluated in the additional analysisprovided by the United States are:• NAEP region (Northeast, Midwest, South,
West);• urban status (urban or non-urban area);• public/private status;• school type (public, Catholic, non-Catholic
religious, non-religious private, Department ofDefense);
• minority percentage category (less than 4 per cent, 4 to 18 per cent, 18 to 34 per cent, 34 to 69 per cent, higher than 69 per cent);
• percentage eligible for school lunch (less than13.85 per cent, 13.85 to 37.1 per cent, higherthan 37.1 per cent, missing);
• estimated number of 15-year-olds; and• school grade-span (junior high: low grade, 7th
and high grade, 9th; high school: low grade,9th; middle school: low grade, 4th and highgrade, 8th; combined school: all others).Each characteristic was chosen because it was
available for both participating and non-participating schools, and because in pastsurveys it was found to be linked to responsepropensity.
The analysis began with a series of logisticregression models that included NAEP region,urban status, and each of the remaining predictorsindividually in turn. Region and urban statuswere included in each model because they havehistorically been correlated with schoolparticipation and are generally used to define cellsfor non-response adjustments. The remainingpredictors were then evaluated with region andurban status already included in the model. Amodel with all predictors was also fitted.
In most models, the South and Northeastregion parameters were significant at the 0.05-level. The p-value for urban status wassignificant at the 0.10-level, except in the modelcontaining all predictors.
The other important predictors were percentageof minority students and percentage of studentseligible for free lunches. The schools with 34 to69 per cent minority students were signif-icantlymore likely to refuse to participate than schoolswith more than 69 per cent minority students (p= 0.0072). The 18 to 34 per cent minoritystudent category also had a relatively small p-value: 0.0857. Two of the categories ofpercentage of students eligible for free luncheshad p-values around 0.10. In the final modelthat included all predictors, the percentage ofminority students and percentage of studentseligible for free lunches proved to be the mostimportant predictors. These results areconfirmed in the model selection analysispresented below.
After fitting a logistic regression model withall potential predictors, three types of modelselection were performed to select significantpredictors: backward, forward, and stepwise.6
For this analysis, all three model-selectionprocedures chose the same four predictors: NAEPregion, urban status, percentage of minoritystudents, and percentage of students eligible forfree lunches. The parameter estimates, chi-squarestatistics and p-values for this final model areshown in Table 73.
The non-response analyses did find differencesin the distribution between respondents and non-respondents on several of the school characteristics.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
192
6 The dependent variable again was the school’s participationstatus. These techniques all select a set of predictor variablesfrom a larger set of candidates. In backward selection, amodel is fitted using all the candidate variables. Significancelevels are then checked for all predictors. If any level isgreater than the cut-off value (0.05 for this analysis), onepredictor is dropped from the model–the one with the greatestp-value (i.e., the least significant predictor). The procedurethen iteratively checks the remaining set of predictors,dropping the least powerful predictor if any are above the cut-off value. NAEP region and urban status were not allowed tobe dropped. The procedure stops when all remainingpredictors have p-values smaller than the cut-off value.Forward selection works in a similar fashion, but starts withonly the intercept term and any effects forced into the model.It then adds to the model the predictor with the next highestlevel of significance, as computed by the score chi-squaredstatistic. The process continues until no effect meets therequired level of significance. An effect is never removedonce it has entered the model. Stepwise selection begins like forward selection, but theeffects do not necessarily remain in the model once they areentered. A forward selection step may be followed by abackwards elimination step if an effect is no longersignificant in the company of a newly-entered effect.
As expected, region and urban status werefound to be predictors of response status. Inaddition, the percentage of minority studentsand percentage of students eligible for freelunches were also important predictors. Sincethey are often correlated with achievement,assuming that school non-response is a randomoccurrence could lead to bias in assessmentestimates.
The school non-response adjustment hasincorporated these two additional predictorsalong with region and urban status in definingthe adjustment cells. Defining the cells in thisway distributes the weight of non-participatingschools to ‘similar’ (in terms of characteristicsknown for all schools) participating schools,which reduces the potential bias due to lack ofassessment in non-participating schools.However, there is no way to completelyeliminate all potential non-response bias becauseassessment information is not available fromschools that do not participate.
There is also a strong, but highly non-linear,relationship with minority (black and Hispanic)enrolment. Schools with relatively high andrelatively low minority enrolments wereconsiderably more likely to participate thanthose with intermediate levels of minority
enrolment. The implications for the direction ofthe bias are not clear. Other characteristicsexamined do not show a discernible relationship.
Many of the 145 participating schoolsrestricted the sample of PISA-eligible students insome way, generally by limiting it to studentswithin the modal grade. Weighting adjustmentswere made to partially compensate for this.Schools that surveyed the modal grade (10) weretreated as respondents in calculating theweighted response rates, while those that did notpermit testing of grade 10 students were handledas non-respondents for calculating schoolresponse rates, but their student data wereincluded in the database.
It was concluded that while these issuesshould be noted, the data could berecommended for inclusion in the full range ofPISA reports.
Dat
a ad
jud
icat
ion
193
Table 73: Results for the United States Logistic Regression
Parameter Estimate Chi-Squared p-value
Northeast region 0.4890 2.7623 0.0965
Midwest region 0.3213 1.1957 0.2742
South region -0.7919 4.7966 0.0285
(West region) - - -
Urban area -0.3644 2.6241 0.1053
(Non-urban area) - - -
Less than 4% minority -0.6871 2.7699 0.0960
4-18% minority -0.2607 0.5384 0.4631
18 -34% minority 0.6348 3.8064 0.0511
34-69% minority 1.3001 13.5852 0.0002
(Greater than 69% minority) - - -
Missing free lunch eligibility 0.5443 3.1671 0.0751
≤ 13.85% free lunch eligibility 0.6531 3.6340 0.0566
13.85-37.1% free lunch eligibility -1.0559 10.329 0.0013
PROFICIENCY SCALESConstruction
Ross Turner
Introduction
PISA seeks to report outcomes in terms ofproficiency scales that are based on scientifictheory and that are interpretable in policy terms.There are two further considerations for thedevelopment of the scales and levels:• PISA must provide a single score for each
country for each of the three domains. It isalso recognised that multiple scales might beuseful for certain purposes, and thedevelopment of these has been consideredalongside the need for a single scale.
• The proficiency descriptions must shed light ontrends over time. The amount of data availableto support the detailed development of profic-iency descriptions varies, depending on whethera particular domain is a ‘major’ or ‘minor’domain for any particular survey cycle. Decisionsabout scale development need to recognise thisvariation, and must facilitate the descriptionof any changes in proficiency levels achieved bycountries from one survey cycle to the next.Development of a method for describing
proficiency in PISA reading, mathematical andscientific literacy took place over more than ayear, in order to prepare for the reporting ofoutcomes of the PISA 2000 surveys. The threeFunctional Expert Groups (FEGs) (for reading,mathematics and science) worked with theConsortium to develop sets of describedproficiency scales for the three PISA 2000 testdomains. Consultations of the Board ofParticipating Countries (BPC), National ProjectManagers (NPMs) and the PISA TechnicalAdvisory Group (TAG) took place over several
stages. The sets of described scales were presentedto the BPC for approval in April 2001. Thischapter presents the outcomes of this process.
Development of theDescribed Scales
The development of described proficiency scalesfor PISA was carried out through a processinvolving a number of stages. The stages aredescribed here in a linear fashion, but in realitythe development process involved somebackwards and forwards movement where stageswere revisited and descriptions wereprogressively refined. Essentially the samedevelopment process was used in each of thethree domains.
Stage 1: Identifying Possible Sub-scales
The first stage in the process involved theexperts in each domain articulating possiblereporting scales (dimensions) for the domain.
For reading, two main options were activelyconsidered. These were scales based on twopossible groupings of the five ‘aspects’ of reading(retrieving information; forming a broadunderstanding; developing an interpretation;reflecting on content; and reflecting on the formof a text).1
In the case of mathematics, a singleproficiency scale was proposed, though the
section five: scale construction and data products
1 While strictly speaking the scales based on aspects ofreading are sub-scales of the combined readingliteracy scale, for simplicity they are mostly referred toas ‘scales’ rather than ‘sub-scales’ in this report.
Chapter 16
possibility of reporting according to the ‘bigideas’ or the ‘competency classes’ described inthe PISA mathematics framework was alsoconsidered. This option is likely to be furtherexplored in PISA 2003 when mathematics is themajor domain.
For science, a single overall proficiency scalewas proposed by the FEG. There was interest inconsidering two sub-scales, for ‘scientificknowledge’ and ‘scientific processes’, but thesmall number of items in PISA 2000, whenscience was a minor domain, meant that this wasnot possible. Instead, these aspects were builtinto descriptions of a single scale.
Wherever multiple scales were under consider-ation, they arose clearly from the framework forthe domain, they were seen to be meaningful andpotentially useful for feedback and reportingpurposes, and they needed to be defensible withrespect to their measurement properties.
Because of the longitudinal nature of the PISAproject, the decision about the number and natureof reporting scales had to take into account thefact that in some test cycles a domain will betreated as ‘minor’ and in other cycles as ‘major’.The amount of data available to support thedevelopment of described proficiency scales willvary from cycle to cycle for each domain, but theBPC expects proficiency scales that can becompared across cycles.
Stage 2: Assigning Items to Scales
The second stage in the process was to reviewthe association of each item in the main studywith each of the scales under consideration. Thiswas done with the involvement of the subject-matter FEGs, the test developers and Consortiumstaff, and other selected experts (for example,some NPMs were involved). The statisticalanalysis of item scores from the field trial wasalso useful in identifying the degree to whichitems allocated to each sub-scale fitted withinthat scale, and in validating the work of thedomain experts.
Stage 3: Skills Audit
The next stage involved a detailed expertanalysis of each item, and in the case of itemswith partial credit, for each score step within theitem, in relation to the definition of the relevant
sub-scale from the domain framework. The skillsand knowledge required to achieve each scorestep were identified and described.
This stage involved negotiation and discussionamong the interested players, circulation of draftmaterial, and progressive refinement of drafts onthe basis of expert input and feedback.
Stage 4: Analysing Field Trial Data
For each set of scales being considered, the fieldtrial data for those items that were subsequentlyselected for the main study were analysed usingitem response techniques to derive difficultyestimates for each achievement threshold foreach item in each sub-scale.
Many items had a single achievementthreshold (associated with getting the item rightrather than wrong). Where partial credit wasavailable, more than one achievement thresholdcould be calculated (achieving a score of one ormore rather than zero, two or more rather thanone, etc.).
Within each sub-scale, achievement thresholdswere placed along a difficulty continuum linkeddirectly to student abilities. This analysis givesan indication of the utility of each scale from ameasurement perspective (for examples, seeFigures 22, 23 and 24).
Stage 5: Defining the Dimensions
The information from the domain-specific expertanalysis (Stage 3) and the statistical analysis(Stage 4) was combined. For each set of scalesbeing considered, the item score steps wereordered according to the size of their associatedthresholds and then linked with the descriptionsof associated knowledge and skills, giving ahierarchy of knowledge and skills that definedthe dimension. Natural clusters of skills werefound using this approach, which provided abasis for understanding each dimension anddescribing proficiency levels.
Stage 6: Revising and Refining MainStudy Data
When the main study data became available, theinformation arising from the statistical analysisabout the relative difficulty of item thresholdswas updated. This enabled a review and revision
PIS
A 2
00
0 t
ech
nic
al r
epor
t
196
of Stage 5 by the working groups, FEGs andother interested parties. The preliminarydescriptions and levels were then reviewed andrevised in the light of further technicalinformation that was provided by the TechnicalAdvisory Group, and an approach to defininglevels and associating students with those levelsthat had arisen from discussion between theConsortium and the OECD Secretariat wasimplemented.
Stage 7: Validating
Two major approaches to validation were thenconsidered and used to varying degrees by thethree working groups. One method was toprovide knowledgeable experts (e.g., teachers, or members of the subject matter expertgroups) with material that enabled them tojudge PISA items against the described levels,or against a set of indicators that underpinnedthe described levels. Some use of such aprocess was made, and further validationexercises of this kind may be used in thefuture. Second, the described scales weresubjected to an extensive consultation processinvolving all PISA countries through theirNPMs This approach to validation rests on theextent to which users of the described scalesfind them informative.
Defining ProficiencyLevels
Developing described proficiency levelsprogressed in two broad phases. The first, whichcame after the development of the describedscales, was based on a substantive analysis ofPISA items in relation to the aspects of literacythat underpinned each test domain. Thisproduced proficiency levels that reflectedobservations of student performance and adetailed analysis of the cognitive demands ofPISA items. The second phase involved decisionsabout where to set cut-off points for levels andhow to associate students with each level. This isboth a technical and very practical matter ofinterpreting what it means to ‘be at a level’, andhas very significant consequences for reportingnational and international results.
Several principles were considered fordeveloping and establishing a useful meaning for
‘being at a level’, and therefore for determiningan approach to locating cut-off points betweenlevels and associating students with them:• A ‘common understanding’ of the meaning of
levels should be developed and promoted.First, it is important to understand that theliteracy skills measured in PISA must beconsidered as continua: there are no naturalbreaking points to mark borderlines betweenstages along these continua. Dividing each ofthese continua into levels, though useful forcommunication about students’ development,is essentially arbitrary. Like the definition ofunits on, for example, a scale of length, thereis no fundamental difference between 1 metreand 1.5 metres—it is a matter of degree. It isuseful, however, to define stages, or levelsalong the continua, because they enable us tocommunicate about the proficiency ofstudents in terms other than numbers. Theapproach adopted for PISA 2000 was that itwould only be useful to regard students ashaving attained a particular level if this wouldmean that we can have certain expectationsabout what these students are capable of ingeneral when they are said to be at that level.It was decided that this expectation wouldhave to mean at a minimum that students at aparticular level would be more likely to solvetasks at that level than to fail them. Byimplication, it must be expected that theywould get at least half of the items correct ona test composed of items uniformly spreadacross that level, which is useful in helping tointerpret the proficiency of students atdifferent points across the proficiency rangedefined at each level.
• For example, students at the bottom of a levelwould complete at least 50 per cent of taskscorrectly on a test set at the level, while studentsat the middle and top of each level would beexpected to achieve a much higher successrate. At the top end of the bandwidth of alevel would be the students who are ‘masters’of that level. These students would be likelyto solve about 80 per cent of the tasks at thatlevel. But, being at the top border of thatlevel, they would also be at the bottom borderof the next level up, where according to thereasoning here they should have a likelihoodof at least 50 per cent of solving any tasksdefined to be at that higher level.
Pro
fici
ency
sca
les
con
stru
ctio
n
197
P=?
P=?
P=?
P=?
Student at Top of Level
Student at Bottom of Level
Item at Top of Level
Item at Bottom of Level
• Further, the meaning of being at a level for agiven scale should be more or less consistentfor each level. In other words, to the extentpossible within the substantively baseddefinition and description of levels, cut-offpoints should create levels of more or lessconstant breadth. Some small variation maybe appropriate, but in order for interpretationand definition of cut-off points and levels tobe consistent, the levels have to be aboutequally broad. Clearly this would not apply tothe highest and lowest proficiency levels,which are unbounded.
• A more or less consistent approach should betaken to defining levels for the different scales.Their breadth may not be exactly the same foreach proficiency scale, but the same kind ofinterpretation should be possible for eachscale that is developed.A way of implementing these principles was
developed. This method links the two variablesmentioned in the preceding paragraphs, and athird related variable. The three variables can beexpressed as follows:• the expected success of a student at a
particular level on a test containing items atthat level (proposed to be set at a minimumthat is near 50 per cent for the student at thebottom of the level, and higher for otherstudents in the level);
• the width of the levels in that scale(determined largely by substantive
considerations of the cognitive demands ofitems at the level and observations of studentperformance on the items); and
• the probability that a student in the middle ofa level would correctly answer an item ofaverage difficulty for that level (in fact, theprobability that a student at any particularlevel would get an item at the same levelcorrect), sometimes referred to as the ‘RP-value’for the scale (where ‘RP’ indicates ‘responseprobability’).Figure 31 summarises the relationship among
these three mathematically linked variables. Itshows a vertical line representing a part of thescale being defined, one of the bounded levels onthe scale, a student at both the top and thebottom of the level, and reference to an item atthe top and an item at the bottom of the level.Dotted lines connecting the students and itemsare labelled P=? to indicate the probabilityassociated with that student correctly respondingto that item.
PISA 2000 implemented the followingsolution: start with the substantively determinedrange of abilities for each bounded level in eachscale (the desired band breadth); then determinethe highest possible RP value that will becommon across domains—that would give effectto the broad interpretation of the meaning of‘being at a level’ (an expectation of correctlyresponding to a minimum of 50 per cent of theitems in a test at that level).
PIS
A 2
00
0 t
ech
nic
al r
epor
t
198
Figure 31: What it Means to be at a Level
After doing this, the exact average percentageof correct answers on a test composed of itemsat a level could vary slightly among the differentdomains, but will always be at least 50 per centat the bottom of the level.
The highest and lowest described levels areunbounded. For a certain high point on the scaleand below a certain low point, the proficiencydescriptions could, arguably, cease to beapplicable. At the high end of the scale, this isnot such a problem since extremely proficientstudents could reasonably be assumed to becapable of at least the achievements described forthe highest level. At the other end of the scale,however, the same argument does not hold. Alower limit therefore needs to be determined forthe lowest described level, below which nomeaningful description of proficiency is possible.
As levels 2, 3 and 4 (within a domain) will beequally broad, it was proposed that the floor ofthe lowest described level be placed at thisbreadth below the upper boundary of level 1(that is, the cut-off between levels 1 and 2).Student performance below this level is lowerthan that which PISA can reliably assess and,more importantly, describe.
Reading Literacy
Scope
The purpose of the PISA reading literacyassessment is to monitor and report on thereading proficiency of 15-year-olds as theyapproach the end of compulsory schooling. Eachtask in the assessment has been designed togather a specific piece of evidence about readingproficiency by simulating a reading activity thata reader might experience in or outside school,as an adolescent, or in adult life.
PISA reading tasks range from verystraightforward comprehension activities to quitesophisticated activities requiring deep andmultiple levels of understanding. Readingproficiency is characterised by organising thetasks into five levels of increasing proficiency,with Level 1 describing what is required torespond successfully to the most basic tasks, andLevel 5 describing what is required to respondsuccessfully to the most demanding tasks.
Three Reading Proficiency Scales
Why aspect scales?The scales represent three major reading aspectsor purposes, which are all essential parts oftypical reading in contemporary developedsocieties: retrieving information from a variety ofreading materials; interpreting what is read; andreflecting upon and evaluating what is read.
People often have practical reasons forretrieving information from reading material, andthe tasks can range from locating details requiredby an employer from a job advertisement tofinding a telephone number with several prefixes.
Interpreting texts involves processing a text tomake internal sense of it. This includes a widevariety of cognitive activities, all of which involvesome degree of inference. For example, a taskmay involve connecting two parts of a text, pro-cessing the text to summarise the main ideas, orfinding a specific instance of something describedearlier in general terms. When interpreting, areader is identifying the underlying assumptionsor implications of part or all of a text.
Reflection and evaluation involves drawing onknowledge, ideas, or attitudes that are externalto the text in order to relate the new informationthat it provides to one’s own conceptual andexperiential frames of reference.
Dividing reading into different aspects is not theonly possible way to divide the task of reading. Texttypes or formats could be used for organising thedomain, as earlier large-scale studies, includingthe International Association for the Evaluationof Educational Achievement (IEA) ReadingLiteracy Study (narrative, exposition anddocument) and the International Adult LiteracySurvey (IALS, prose and document) did.However, using aspects seems to reflect mostclosely the policy objectives of PISA, and thehope is that the provision of these scales willoffer a different perspective to understandinghow reading proficiency develops.
Why three scales rather than five?The three scales are based on the set of fiveaspect variables described in the PISA readingliteracy framework (OECD, 1999a): retrievinginformation; forming a broad understanding;developing an interpretation; reflecting on thecontent of a text; and reflecting on the form of atext. These five aspects were defined primarily toensure that the PISA framework and the
Pro
fici
ency
sca
les
con
stru
ctio
n
199
collection of items developed according to thisframework would provide an optimal coverage ofthe domain of reading literacy as it was definedby the experts. For reporting purposes and forcommunicating the results, a more parsimoniousmodel was thought to be more appropriate.Collapsing the five-aspect framework on to threeaspect scales is justified by the high correlationsbetween the sets of items. Consequently ‘developingan interpretation’ and ‘forming a broad under-standing’ have been grouped together becauseinformation provided in the text is processed bythe reader in some ways. In the case of ‘forminga broad understanding’, this occurs in the wholetext, and in the case of ‘developing an inter-pretation’, this occurs in one part of the text inrelation to another. ‘Reflecting on the content ofa text’ and ‘reflecting on the form of a text’ havebeen collapsed into a single reflection andevaluation sub-scale because the distinctionbetween reflecting on form and reflecting oncontent, in practice, was not found to be notice-able in the data as they were collected in PISA.
Even these three scales overlap considerably. Inpractice, most tasks make many different demandson readers. The three aspects are conceived asinterrelated and interdependent, and assigning atask to one or another of the scales is often a matterof fine discrimination about its salient features,and about the approach typically taken to it.Empirical evidence from the main study data wasused to validate these judgements. Despite theinterdependence of the three scales, they revealinteresting and useful distinctions between countriesand among sub-groups within the countries.
Pragmatic and technical reasons alsodetermined the use of only three aspects todescribe proficiency scales. In 2003 and 2006,reading will be a minor domain and will thereforebe restricted to about 30 items, which would beinsufficient for reporting over five scales.
Task Variables
In developing descriptions of the conditions andfeatures of tasks along each of the three scales,distinct, intersecting sets of variables associatedwith task difficulty seemed to be operating.These variables are based on the judgements byreading experts of the items that fall within eachlevel band rather than on any systematic analysis,although systematic analyses will be made later.
The PISA definition of reading literacy builds,in part, on the IEA Reading Literacy Study (Elley,1992), but also on IALS. It reflects the emphasisof the latter study on the importance of readingskills in active and critical participation insociety. It was also influenced by current theorieswhich emphasise the interactive nature of reading(Dechant, 1991; McCormick, 1988; Rumelhart,1985), on models of discourse comprehension(Graesser, Millis and Zwaan, 1997; Kintsch andvan Dijk, 1978; Van Dijk and Kintsch, 1983),and on theories of performance in solvingreading tasks (Kirsch and Mosenthal, 1990). Thedefinition of variables operating within thereflection and evaluation sub-scale is, we believe,PISA’s major new contribution to conceptualisingreading. PISA is the first large-scale study toattempt to define variables operating within thereflection and evaluation sub-scale and topropose characterising populations from thisperspective in its reporting.
The apparently salient variables in each scaleare described briefly below. It is important toremember that the difficulty of any particularitem is conditioned by the interaction among allthe variables.
Task Variables for the combinedReading Literacy Scale
The overall proficiency scale for reading literacycovers three broad aspects of reading: retrieving,interpreting and reflecting upon and evaluatinginformation. The proficiency scale includes tasksassociated with each of these.
The difficulty of any reading task depends onan interaction between several variables. The typeof process involved in retrieving, interpreting, orreflecting on and evaluating information is salientin relation to difficulty. The complexity andsophistication of processes range from makingsimple connections between pieces of information,to categorising ideas according to given criteria,to critically evaluating a section of text. Thedifficulty of retrieval tasks is associated particularlywith the number of pieces of information to beincluded in the response, the number of criteriathat the information must meet, and whetherwhat is retrieved needs to be sequenced in aparticular way. For interpretative and reflectivetasks, the length, complexity and amount of textthat needs to be assimilated particularly affect
PIS
A 2
00
0 t
ech
nic
al r
epor
t
200
difficulty. The difficulty of items requiringreflection is also conditioned by the familiarity orspecificity of the knowledge that must be drawnon from outside the text. For all aspects ofreading, the tasks tend to be more difficult whenthere is no explicit mention of the ideas orinformation required to complete them, therequired information is not prominent in the text,and much competing information is present.
Task Variables for the RetrievingInformation Sub-Scale
Four important variables characterising the PISAretrieving information tasks were identified.They are outlined here and are incorporated inthe proficiency scale description in Figure 32.
Type of retrievalType of retrieval is related to the type of matchmicro-aspect referred to in the readingframework for PISA (OECD, 1999a). Locating isthe primary process used for tasks in this scale.Locating tasks require a reader to find andretrieve information in a text using criteria thatare specified in a question or directive. The morecriteria that need to be taken into account, themore the task will tend to be difficult. Thereader may need to consult a text several timesby cycling through it to find and retrievedifferent pieces of information. The morenumerous the pieces of information to beretrieved, and whether or not they must besequenced in a particular way, the greater thedifficulty of the task.
Explicitness of informationExplicitness of information refers to the degreeof literalness or explicitness in the task and inthe text, and considers how much the reader mustuse inference to find the necessary information.The task becomes more difficult when the readermust infer what needs to be considered in lookingfor information, and when the information to beretrieved is not explicitly provided. This variableis also related to the type of match micro-aspectreferred to in the reading framework for PISA.While retrieving information tasks generally requireminimal inference, the more difficult tasks requiresome processing to match the requisite informationwith what is in the text, and to select relevantinformation through inference and prioritising.
Nature of competing informationNature of competing information is similar tothe plausibility of the distractors micro-aspectreferred to in the reading framework for PISA.Information is competing when the reader maymistakenly retrieve it because it resembles thecorrect information in one or more ways. Withinthis variable, the prominence of the correctinformation also needs to be considered: it willbe relatively easy to locate if it is in a heading,located near the beginning of the text, orrepeated several times.
Nature of textNature of text refers to aspects such as text lengthand complexity. Longer and more complex textsare harder to negotiate than simpler and shortertexts, all other things being equal. Althoughnature of text is a variable in all scales, itscharacteristics as a variable and probably itsimportance differ in retrieving information and inthe other two scales. This is probably becauseretrieving information tends to focus more onsmaller details of text than interpreting texts andreflection and evaluation do. The reader thereforeneeds to take comparatively little account of theoverall context (the text) in which the informationis located, especially if the text is well structuredand if the question or directive refers to it.
Task Variables for the InterpretingTexts Sub-Scale
A similar set of variables to those for retrievinginformation was identified for the PISAinterpreting texts tasks. In some respects, thevariables behave differently in relation to theinterpreting texts sub-scale, as outlined here andincorporated in the scale description in Figure33.
Type of interpretationType of interpretation is related to both the typeof match and the type of information micro-aspects referred to in the reading framework forPISA. Several processes have been identified thatform a hierarchy of complexity, and tend to beassociated with increasing difficulty. At theeasier end, readers need to identify a theme ormain idea. More difficult tasks requireunderstanding relationships within the text thatare an inherent part of its organisation and
Pro
fici
ency
sca
les
con
stru
ctio
n
201
meaning such as cause and effect, problem andsolution, goal and action, claim and evidence,motive and behaviour, precondition and action,explanation for an outcome, and time sequence.The most difficult tasks are of two kinds:construing meaning, which requires the reader tofocus on a word, a phrase or sentence, in orderto identify its particular effect in context andmay involve dealing with ambiguities or subtlenuances of meaning; and analogical reasoning,which requires comparing (finding similaritiesbetween parts of a text), contrasting (findingdifferences), or categorising ideas (identifyingsimilarities and differences in order to classifyinformation from within or relevant to the text).The latter may require either fitting examplesinto a category supplied in the text or findingtextual examples of a given category.
Explicitness of informationExplicitness of information encompasses howmuch direction the reader is given in focusingthe interpretation appropriately. The factors tobe considered may be indicated explicitly in theitem, and readily matched to one or more partsof the text. For a more difficult task, readersmay have to consider factors that are not statedeither in the item or the text. Generally,interpreting tasks require more inference thanthose in the retrieving information sub-scale, butthere is a wide range of demand within thisvariable in the interpreting texts sub-scale.
The difficulty of a task is also conditioned bythe number of factors that need to be considered,and the number of pieces of information that thereader needs to respond to the item. The greaterthe number of factors to consider or pieces ofinformation to be supplied, the more difficult thetask becomes.
Nature of competing informationNature of competing information functionssimilarly to the variable with the same name inthe retrieving information sub-scale. Forinterpreting items, however, explicit informationis likely to compete with the implicit informationrequired for the task.
Nature of textNature of text is relevant to the interpretingtexts sub-scale in several ways. The familiarity ofthe topic or theme of the text is significant. Textswith more familiar topics and more personalthemes tend to be easier to assimilate than thosewith themes that are remote from the reader’sexperience and that are more public orimpersonal. Similarly, texts in a familiar form orgenre are easier to manage than those inunfamiliar forms. The length and complexity ofa text (or the part of it that needs to beconsidered) also play a role in the difficulty of atask. The more text the reader has to take intoaccount, the more difficult a task is likely to be.
Task Variables for the Reflection andevaluation Sub-Scale
Some of the variables identified as relevant toretrieving information and interpreting texts arealso relevant to the PISA reflection and evaluationtasks, though operating differently to some extent.The reflection and evaluation dimension, however,is influenced by a wider range of task variablesthan the other reading aspects, as itemised here andincorporated in the scale description in Figure 34.
Type of reflectionFive reflecting processes were identified. Thereflection and evaluation sub-scale associates eachtype of reflection with a range of difficulties, buton average, the types of reflection vary incomplexity and tend to be associated withincreasing difficulty.
Connecting is at the lowest level of reflection.A connection requires the reader to make a basiclink between the text and knowledge fromoutside it. Connecting may involvedemonstrating an understanding of a constructor concept underlying the text. It may be amatter of content, such as finding an example ofa concept (e.g., ‘fact’ or ‘kindness’), or mayinvolve demonstrating an understanding of theform or linguistic function of features of a text,such as recognising the purpose of conventionalstructures or linguistic features. For example, thetask may require the reader to articulate therelationship between two parts of a text.
The following two types of reflection seem tobe more difficult than connecting, but werefound to be of similar difficulty.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
202
Explaining involves going beyond the text to givereasons for the presence or purpose of text-basedinformation or features that are consistent with theevidence presented in the text. For example, a readermay be asked to explain why there is a questionon an employment application form about how fara prospective employee lives from the workplace.
When comparing, the reader needs to findsimilarities (or differences) between something inthe text and something from outside it. In itemsdescribed as involving comparing, for example,readers may be asked to compare a character’sbehaviour in a narrative with the behaviour ofpeople they know; or to say whether their ownattitude to life resembles that expressed in apoem. Comparisons become more demanding asthe range of approaches from which thecomparison can be made is more constricted andless explicitly stated, and as the knowledge to bedrawn upon becomes less familiar and accessible.
The most difficult tasks involve eitherhypothesising or evaluating.
Hypothesising involves going beyond the textto offer explanations for text-based informationor features, which are consistent with the evidencepresented in the text. This type of reflection couldbe termed simply ‘explaining’ if the context is veryfamiliar to the reader, or if some supporting infor-mation were supplied in the text. However, hypo-thesising tends to be relatively demanding becausethe reader needs to synthesise several pieces of infor-mation from both inside and outside the text, andto generate plausible conclusions rather than toretrieve familiar knowledge and apply it to thetext. The degree to which signals are given aboutwhat kind of knowledge needs to be drawn on,and the degree of familiarity or concreteness ofthe knowledge, contribute to the difficulty ofhypothesising tasks.
Evaluating involves making a judgement abouta text as a whole or about some part or feature ofit. This kind of critical thinking requires the readerto call upon an internalised hierarchy of values(good/bad; appropriate/inappropriate; right/wrong).In PISA items, the process of evaluating is oftenrelated to textual structure, style, or internalcoherence. The reader may need to considercritically the writer’s point of view or the text’sstylistic appropriateness, logical consistency, orthematic coherence.
Nature of reader’s knowledgeThe nature of reader’s knowledge brought to
bear from outside the text, on a continuum fromgeneral (broad, diverse and commonly known)to specialised (narrow, specific and related to aparticular domain), is important in reflectingtasks. The term ‘knowledge’ includes attitudes,beliefs and opinions as well as factualknowledge—any cognitive phenomenon that thereader may draw on or possess. The domain andreference point for the knowledge andexperience on which the reader needs to draw toreflect upon the text contribute to the difficultyof the task. When the knowledge must be highlypersonal and subjective, there is a lower demandbecause there is a very wide range of appropriateresponses and no strong external standardagainst which to judge them. As the knowledgeto be drawn upon becomes more specialised, lesspersonal and more externally referenced, therange of appropriate responses narrows. Forexample, if knowledge about literary styles orinformation about public events is a necessaryprerequisite for a sensible reflection, the task islikely to be more difficult than a task that asksfor a personal opinion about parental behaviour.
In addition, difficulty is affected by the extentto which the reader has to generate the terms ofthe reflection. In some cases, the item or the textfully define what needs to be considered in thecourse of the reflection. In other cases, readersmust infer or generate their own set of factors asthe basis for an hypothesis or evaluation.
Nature of textNature of text resembles the variable describedfor the interpreting texts sub-scale, above. Reflectingupon items that are based on simple, short texts(or simple short parts of longer texts) are easierthan those based on longer, more complex texts.The degree of textual complexity is judged interms of its content, linguistic form, or structure.
Nature of understanding of the textNature of understanding of the text is close tothe ‘type of match’ micro-process described inthe reading framework. It encompasses the kindof processing the reader needs to engage inbefore reflecting on the text. If the reflection isto be based on locating a small section ofexplicitly indicated text (locating), the task willtend to be easier than if it is based on acontradiction in the text that must first beinferred (inferring a logical relationship). Thetype of understanding of the text may entail P
rofi
cien
cy s
cale
s co
nst
ruct
ion
203
2 Tasks at this level require the reader to locate oneor more pieces of information, which may need tomeet multiple criteria. Typically some competinginformation is present.
• Finds explicitly stated information in a notice about animmunisation program in the work place. The relevantinformation is near the beginning of the notice butembedded in a complex sentence. There is significantcompeting information. [Flu 2 R077Q02]
partial or a broad, general understanding of themain idea. Reflecting items become moredifficult as the reflection requires fuller, deeperand richer textual understanding.
Retrieving Information Sub-Scale
The PISA 2000 retrieving information sub-scaleis described in Figure 32.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
204
4 Tasks at this level require the reader to locate andpossibly sequence or combine multiple pieces ofembedded information. Each piece of informationmay need to meet multiple criteria. Typically thecontent and form of the text are unfamiliar. Thereader may need to make inferences in order todetermine which information in the text is relevantto the task.
• Marks the positions of two actors on a stage diagram byreferring to a play script. The two pieces of informationneeded are embedded in a stage direction in a long anddense text. [Amanda 4 R216Q04]
3 Tasks at this level require the reader to locate andin some cases recognise the links between pieces ofinformation, each of which may be required tomeet multiple criteria. Typically there is prominentcompeting information.
• Identifies the starting date of a graph showing the depthof a lake over several thousand years. The informationis given in prose in the introduction, but the date is notexplicitly marked on the graph’s scale. The first markeddate on the scale constitutes strong competinginformation. [Chad 3A R040Q03A]
• Finds the relevant section of a magazine article foryoung people about DNA testing by matching with aterm in the question. The precise information that isrequired is surrounded by a good deal of competinginformation, some of which appears to be contradictory.[Police 4 R100Q04]
1 Tasks at this level require the reader to locate oneor more independent pieces of explicitly statedinformation. Typically there is a single criterionthat the located information must meet and thereis little if any competing information in the text.
Below Insufficient information to describe features ofLevel 1 tasks at this level.
• Finds a literal match between a term in the question andthe required information. The text is a long narrative;however, the reader is specifically directed to therelevant passage which is near the beginning of thestory. [Gift 6 R119Q06]
• Locates a single explicitly stated piece of information ina notice about job services. The required information issignalled by a heading in the text that literally matches aterm in the question. [Personnel 1 R234Q01]
Figure 32: The Retrieving Information Sub-Scale
5 Tasks at this level require the reader to locate andpossibly sequence or combine multiple pieces ofdeeply embedded information, some of which maybe outside the main body of the text. Typically thecontent and form of the text are unfamiliar. Thereader needs to make high-level inferences in orderto determine which information in the text isrelevant to the task. There is highly plausibleand/or extensive competing information.
• Locates and combines two pieces of numericinformation on a diagram showing the structure of thelabour force. One of the relevant pieces of informationis found in a footnote to the caption of the diagram.[Labour 3 R088Q03, score category 2]
Distinguishing features of tasks at each level:
In a typical task at this level, the reader:
Pro
fici
ency
sca
les
con
stru
ctio
n
205
Interpreting Texts Sub-Scale
The PISA 2000 interpreting texts sub-scale isdescribed in Figure 33.
5 Tasks at this level typically require eitherconstruing of nuanced language or a full anddetailed understanding of a text.
• Analyses the descriptions of several cases in order todetermine the appropriate labour force status categories.Infers the criteria for assignment of each case from a fullunderstanding of the structure and content of a treediagram. Some of the relevant information is infootnotes and is therefore not prominent. [Labour 4R088Q04, score category 2]
4 Tasks at this level typically involve eitherunderstanding and applying categories in anunfamiliar context; or construing the meaning ofsections of text by taking into account the text as awhole. These tasks require a high level of text-based inference. Competing information at thislevel is typically in the form of ambiguities, ideasthat are contrary to expectation, or ideas that arenegatively worded.
• Construes the meaning of a sentence in context bytaking into account information across a large section oftext. The sentence in isolation is ambiguous and thereare apparently plausible alternative readings. [Gift 4R119Q04]
Distinguishing features of tasks at each level:
In a typical task at this level, the reader:
3 Tasks at this level require the reader to considermany elements of the text. The reader typically needsto integrate several parts of a text in order to identifya main idea, understand a relationship, or construethe meaning of a word or phrase. The reader mayneed to take many criteria into account whencomparing, contrasting or categorising. Often therequired information is not prominent but implicitin the text, and the reader may be distracted bycompeting information that is explicit in the text.
• Explains a character’s behaviour in a given situation bylinking a chain of events and descriptions scatteredthrough a long story. [Gift 8 R119Q08]
• Integrates information from two graphic displays toinfer the concurrence of two events. The two graphicsuse different conventions, and the reader must interpretthe structure of both graphics in order to translate therelevant information from one form to the other. [Chad6 R040Q06]
2 Some tasks require inferring the main idea in a textwhen the information is not prominent. Othersrequire understanding relationships or construingmeaning within a limited part of the text, bymaking low-level inferences. Tasks at this levelmay involve forming or applying simple categories.
• Uses information in a brief introduction to infer themain theme of an extract from a play script. [Amanda 1R216Q01]
1 Tasks at this level typically require inferring themain theme or author’s purpose in a text about afamiliar topic when the idea is prominent, eitherbecause it is repeated or because it appears early inthe text. There is little if any competinginformation in the text.
• Recognises the main theme of a magazine article forteenagers about sports shoes. The theme is implied inthe sub-heading and repeated several times in the bodyof the article. [Runners 1 R110Q01]
Below Insufficient information to describe features ofLevel 1 tasks at this level.
Figure 33: The Interpreting Texts Sub-Scale
2 Some tasks at this level require readers to makeconnections or comparisons between the text andoutside knowledge. For others, readers need todraw on personal experience and attitudes toexplain a feature of the text. The tasks require abroad understanding of the text.
• Compares claims made in two short texts (letters aboutgraffiti) with his/her own views and attitudes. Broadunderstanding of at least one of the two letter writers’opinions needs to be demonstrated. [Graffiti 6AR081Q06A]
PIS
A 2
00
0 t
ech
nic
al r
epor
t
206
Reflection and Evaluation Sub-Scale
The PISA 2000 reflection and evaluation sub-scale is described in Figure 34.
Distinguishing features of tasks at each level:
In a typical task at this level, the reader:
5 Tasks at this level require critical evaluation orhypothesis, and may draw on specialisedknowledge. Typically these tasks require readers todeal with concepts that are contrary toexpectations, and to draw on a deep understandingof long or complex texts.
• Hypothesises about an unexpected phenomenon: that anaid agency gives relatively low levels of support to avery poor country. Takes account of inconspicuous aswell as more obvious information in a complex text ona relatively unfamiliar topic (foreign aid) [PlanInternational 4B R099Q4B, score category 2]
• Evaluates the appropriateness of an apparentlycontradictory section of a notice about an immunisationprogram in the workplace, taking into account thepersuasive intent of the text and/or its logical coherence.[Flu 5 R077Q05]
4 Tasks at this level typically require readers tocritically evaluate a text, or hypothesise aboutinformation in the text, using formal or publicknowledge. Readers must demonstrate an accurateunderstanding of the text, which may be long orcomplex.
• Evaluates the writer’s craft by comparing two shortletters on the topic of graffiti. Readers need to draw ontheir understanding of what constitutes good style inwriting. [Graffiti 6B R081Q06B]
3 Tasks at this level may require connections,comparisons and explanations, or they may requirethe reader to evaluate a feature of the text. Sometasks require readers to demonstrate a detailedunderstanding of the text in relation to familiar,everyday knowledge. Other tasks do not requiredetailed text comprehension but require the readerto draw on less common knowledge. The readermay need to infer the factors to be considered.
• Connects his/her own concepts of compassion andcruelty with the behaviour of a character in a narrative,and identifies relevant evidence of such behaviour in thetext. [Gift 9 R119Q09, score category 2]
•Evaluates the form of a text in relation to its purpose.The text is a tree diagram showing the structure of thelabour force. [Labour 7 R088Q07]
1 Tasks at this level require readers to make a simpleconnection between information in the text andcommon, everyday knowledge. The reader isexplicitly directed to consider relevant factors inthe task and in the text.
• Makes a connection by articulating the relationshipbetween two parts of a single, specified sentence in amagazine article for young people about sports shoes.[Runners 6 R110Q06]
Below Insufficient information to describe features ofLevel 1 tasks at this level.
Figure 34: The Reflection and Evaluation Sub-Scale
Combined Reading Literacy Scale
The PISA 2000 combined reading literacy scaleis described in Figure 35.
Distinguishing features of tasks at each level:
5 The reader must: sequence or combine several pieces of deeply embedded information, possiblydrawing on information from outside the main body of the text; construe the meaning of linguisticnuances in a section of text; or make evaluative judgements or hypotheses, drawing on specialisedknowledge. The reader is generally required to demonstrate a full, detailed understanding of adense, complex or unfamiliar text, in content or form, or one that involves concepts that arecontrary to expectations. The reader will often have to make inferences to determine whichinformation in the text is relevant, and to deal with prominent or extensive competing information.
4 The reader must: locate, sequence or combine several pieces of embedded information; infer the meaningof a section of text by considering the text as a whole; understand and apply categories in an unfamiliarcontext; or hypothesise about or critically evaluate a text, using formal or public knowledge. The readermust draw on an accurate understanding of long or complex texts in which competing information maytake the form of ideas that are ambiguous, contrary to expectation, or negatively worded.
3 The reader must: recognise the links between pieces of information that have to meet multiplecriteria; integrate several parts of a text to identify a main idea, understand a relationship orconstrue the meaning of a word or phrase; make connections and comparisons; or explain orevaluate a textual feature. The reader must take into account many features when comparing,contrasting or categorising. Often the required information is not prominent but implicit in the textor obscured by similar information.
2 The reader must: locate one or more pieces of information that may be needed to meet multiplecriteria; identify the main idea, understand relationships or construe meaning within a limited partof the text by making low-level inferences; form or apply simple categories to explain something ina text by drawing on personal experience and attitudes; or make connections or comparisonsbetween the text and everyday outside knowledge. The reader must often deal with competinginformation.
1 The reader must: locate one or more independent pieces of explicitly stated information accordingto a single criterion; identify the main theme or author’s purpose in a text about a familiar topic; ormake a simple connection between information in the text and common, everyday knowledge.Typically, the requisite information is prominent and there is little, if any, competing information.The reader is explicitly directed to consider relevant factors in the task and in the text.
Below There is insufficient information to describe features of tasks at this level.Level 1
Figure 35: The Combined Reading Literacy Scale
Cut-off Points for Reading
The definition of proficiency levels for reading isbased on a bandwidth of 0.8 logits, and an RPlevel of 0.62 (see ‘Defining Proficiency Levels’earlier in this chapter).
Using these criteria, Figure 36 gives the levelboundaries for the three reading aspect scalesand the combined reading literacy scale, basedon the PISA scale with a mean of 500 and astandard deviation of 100 (see Chapter 13).
Boundary Cut-Off Point on PISA Scale
Level 4 / Level 5 625.6
Level 3 / Level 4 552.9
Level 2 / Level 3 480.2
Level 1 / Level 2 407.5
Below Level 1 / Level 1 334.8
Figure 36: Combined Reading Literacy Scale LevelCut-Off Points P
rofi
cien
cy s
cale
s co
nst
ruct
ion
207
Mathematical Literacy
What is Being Assessed?
PISA’s mathematical literacy tasks assessstudents’ ability to recognise and interpretproblems encountered in their world; to translatethe problems into a mathematical context; to usemathematical knowledge and procedures fromvarious areas to solve a problem within itsmathematical context; to interpret results interms of the original problem; and to reflectupon methods applied and communicate theoutcomes.
In the PISA mathematical literacy scale, thekey elements defining the increasing difficulty oftasks at successive levels are the number andcomplexity of processing or computation steps;connection and integration demands (from useof separate elements, to integration of differentpieces of information, different representations,or different mathematical tools or knowledge);and representation, modelling, interpretation andreflection demands (from recognising and usinga familiar, well-formulated model, toformulating, translating or creating a usefulmodel within an unfamiliar context, and usinginsight, reasoning, argumentation andgeneralisation).
The factors that influence item difficulty, andtherefore contribute to a definition ofmathematical proficiency, include the following:• the kind and degree of interpretation and
reflection required, which includes the natureof demands arising from the context of theproblem; the extent to which themathematical demands of the problem areapparent or to which students must imposetheir own mathematical construction; and theextent to which insight, complex reasoningand generalisation are required; and
• the kind and level of mathematical skillrequired, which ranges from single-stepproblems requiring students to reproducebasic mathematical facts and simplecomputation processes to multi-step problemsinvolving more advanced mathematicalknowledge; and complex decision-making,information processing, and problem solvingand modelling skills.A single scale was proposed for PISA
mathematical literacy. Five proficiency levels canbe defined although an initial set of descriptions
of three levels lowest, middle, andhighest was developed for use in PISA 2000.
At the lowest level of proficiency that isdescribed, students typically negotiate single-stepprocesses that involve recognising familiarcontexts and mathematically well-formulatedproblems, reproducing well-known mathematicalfacts or processes, and applying simplecomputational skills.
At higher levels of proficiency, studentstypically carry out more complex tasks involvingmore than a single processing step, and combinedifferent pieces of information or interpretdifferent representations of mathematicalconcepts or information, recognising whichelements are relevant and important. To identifysolutions, they typically work with givenmathematical models or formulations, which arefrequently in algebraic form, or carry out a smallsequence of processing or a number ofcalculation steps to produce a solution.
At the highest level of proficiency that isdescribed, students take a more creative andactive role in their approach to solvingmathematical problems. They typically interpretmore complex information and negotiate anumber of processing steps. They typicallyproduce a formulation of a problem and oftendevelop a suitable model that facilitates solvingthe problem. Students at this level typicallyidentify and apply relevant tools and knowledgefrequently in an unfamiliar context, demonstrateinsight to identify a suitable solution strategy,and display other higher order cognitiveprocesses such as generalisation, reasoning andargumentation to explain or communicateresults.
Given that the amount of data available fromthe PISA 2000 study is limited becausemathematics was a minor test domain, it wasdecided that it would be unwise to establish cut-off points between five levels.
The PISA 2000 mathematical literacy scale isdescribed in Figure 37.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
208
209
Figure 37: The Mathematical Literacy Scale
400
500
600
700
800
300
Highest Level Described
• Show insight in the solution of problems;• Develop a mathematical interpretation and formulation of problems set in
a real-world context;• Identify relevant mathematical tools or methods for solution of problems
in unfamiliar contexts;• Solve problems involving several steps;• Reflect on results and generalise findings; and• Use reasoning and mathematical argument to explain solutions and
communicate outcomes.
Middle Level Described
• Interpret, link and integrate different information in order to solve aproblem;
• Work with and connect different mathematical representations of aproblem;
• Use and manipulate given mathematical models to solve a problem;• Use symbolic language to solve problems; and• Solve problems involving a small number of steps.
Lowest Level Described
• Recognise familiar elements in a problem, and recall knowledge relevantto the problem;
• Reproduce known facts or procedures to solve a problem;• Apply mathematical knowledge to solve problems that are simply
expressed and are either already formulated in mathematical terms, orwhere the mathematical formulation is straightforward; and
• Solve problems involving only one or two steps.P
rofi
cien
cy s
cale
s co
nst
ruct
ion
PIS
A 2
00
0 t
ech
nic
al r
epor
t
210
Illustrating the PISA MathematicalLiteracy Scale
The items from two different units (Apples andSpeed of Racing Car) have been selected toillustrate the PISA mathematical literacy scale.Figure 38 shows where each item on the scale islocated, and summarises the item to show how itillustrates the scale.
Item PISA Scale Comment
Apples 732 Students are given a hypothetical scenario involving planting an orchard ofM136Q03 apple trees in a square pattern, with a ‘row’ of protective conifer trees
around the square. This item requires students to show insight intomathematical functions by comparing the growth of a linear function withthat of a quadratic function. Students are required to construct a verbaldescription of a generalised pattern, and to create an argument usingalgebra. Students need to understand both the algebraic expressions used todescribe the pattern and the underlying functional relationships in such away that they can see and explain the generalisation of these relationshipsin an unfamiliar context. A chain of reasoning and communication of thisreasoning in a written explanation is required.
Apples 664 Students are given two algebraic expressions that describe the growth in M136Q02 the number of trees (apple trees and the protective conifers) as the orchard
increases in size, and are asked to find a value for which the twoexpressions coincide. This item requires students to interpret expressionscontaining words and symbols—to link different representations (pictorial,verbal and algebraic) for two relationships (one quadratic and one linear).Students have to find a strategy for determining when the two functionswill have the same solution (for example, by trial and error, or by algebraicmeans); and to communicate the result by explaining the reasoning andcalculation steps involved.
Racing Car 664 Students are given a graph showing the speed of a racing car as it progresses M159Q05 at various distances along a race track. For this item, they are also given a
set of plans of hypothetical tracks and asked to identify which could havegiven rise to the graph. The item requires students to understand andinterpret a graphical representation of a physical relationship (speed anddistance of a car) and relate it to the physical world. Students need to linkand integrate two very different visual representations of the progress of acar around a race track and to identify and select the correct option fromamong challenging alternatives.
Apples 557 Students are given a hypothetical scenario involving planting an orchard of M136Q01 apple trees in a square pattern, with a ‘row’ of protective conifer trees
around the square. They are asked to complete a table of values generatedby the functions that describe the number of trees as the size of the orchardis increased. This item requires students to interpret a written descriptionof a problem situation, to link this to a tabular representation of some ofthe information, and to recognise and extend a pattern. Students need towork with given models in order to relate two different representations(pictorial and tabular) of two relationships (one quadratic and one linear)to extend the pattern.
Pro
fici
ency
sca
les
con
stru
ctio
n
211
Scientific Literacy
What is Being Assessed?
PISA’s scientific literacy tasks assess students’ability to use scientific knowledge(understanding scientific concepts); to recognisescientific questions and to identify what isinvolved in scientific investigations(understanding the nature of scientificinvestigation); to relate scientific data to claimsand conclusions (using scientific evidence); andto communicate these aspects of science.
In the PISA scientific literacy scale, the keyelements defining the increasing difficulty oftasks at successive levels are the complexity ofthe concepts encountered, the amount ofinformation given, the type of reasoningrequired, and the context and sometimes theformat and presentation of the items. There is aprogression of difficulty across tasks in the useof scientific knowledge involving:• the recall of simple scientific knowledge,
common scientific knowledge or givenquestions, variables or data;
• scientific concepts or questions and details ofinvestigations; and
• elaborated scientific concepts or additionalinformation or a chain of reasoning; throughto simple scientific conceptual models oranalysis of investigations or evidence foralternative perspectives.The scientific proficiency scale is based on a
definition of scientific literacy linked to fourprocesses, as formulated in the PISA scienceframework (OECD, 1999a):• Understanding scientific concepts concerns the
ability to make use of scientific knowledgeand to demonstrate an understanding ofscientific concepts by applying scientific ideas,information, or appropriate concepts (notgiven in the stimulus or test item) to a givensituation. This may involve explainingrelationships, scientific events or phenomenaor possible causes for the changes that areindicated, or making predictions about theeffects of given changes or identifying thefactors that would influence a given outcome.
tem PISA Scale Comment
Racing Car 502 Students are given a graph showing the speed of a racing car as it M159Q01 progresses at various distances along a race track. For this item they are
asked to interpret the graph to find a distance that satisfies a givencondition. The item requires students to interpret a graphicalrepresentation of a physical relationship (distance and speed of a cartravelling on a track of unknown shape). Students need to interpret thegraph by linking a verbal description with two particular features of thegraph (one simple and straightforward, and one requiring a deeperunderstanding of several elements of the graph and what it represents),then identify and read the required information from the graph, selectingthe best option from the given alternatives.
Racing Car 423 In this item, students are asked to interpret the speed of the car at a M159Q03 particular point in the graph. The item requires students to read
information from a graph representing a physical relationship (speed anddistance of a car). Students need to identify the place in the graph referredto in a verbal description, recognise what is happening to the speed of thevehicle at that point, then select the best matching option from amonggiven alternatives.
Racing Car 412 This item asks students to read from the graph a single value that satisfies M159Q02 a simple condition. The item requires students to read information from a
graph representing a physical relationship (speed and distance of a car).Students need to identify one specified feature of the graph (the display ofspeed), read directly from the graph a value that minimises the feature,then select the best match from among given alternatives.
Figure 38: Typical Mathematical Literacy Tasks
Figure 39: The Scientific Literacy Scale
400
500
600
700
800
300
Highest Level Described
• Create or use simple conceptual models to make predictions or giveexplanations;
• Analyse scientific investigations in relation to, for example, experimentaldesign, identification of idea being tested;
• Relate data as evidence to evaluate alternative viewpoints or differentperspectives; and
• Communicate scientific arguments and/or descriptions in detail and withprecision.
Middle Level Described
• Use scientific concepts in making predictions or giving explanations;• Recognise questions that can be answered by scientific investigation
and/or identify details of what is involved in a scientific investigation; and• Select relevant information from competing data or chains of reasoning
in drawing or evaluating conclusions.
Lowest Level Described
• Recall simple scientific factual knowledge (e.g., names, facts,terminology, simple rules); and
• Use common knowledge in drawing or evaluating conclusions.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
212
• Understanding the nature of scientificinvestigation concerns the ability to recognisequestions that can be scientifically investigatedand to be aware of what is involved in suchinvestigations. This may involve recognisingquestions that could be or are being answeredin an investigation, identifying the question oridea that was being (or could have been) testedin a given investigation, distinguishing questionsthat can and cannot be answered by scientificinvestigation, or suggesting a question that couldbe investigated scientifically in a given situation.It may also involve identifying evidence/dataneeded to test an explanation or explore anissue, requiring the identification or recognitionof variables that should be compared, changedor controlled, additional information thatwould be needed, or action that should betaken so that relevant data can be collected.
• Using scientific evidence concerns the abilityto make sense of scientific data as evidencefor claims or conclusions. This may involveproducing a conclusion from given scientificevidence/data or selecting the conclusion thatfits the data from alternatives. It may alsoinvolve giving reasons for or against a givenconclusion in terms of the data provided, oridentifying the assumptions made in reachinga conclusion.
• Communicating scientific descriptions orarguments concerns the ability tocommunicate scientific descriptions,arguments and explanations to others. Thisinvolves communicating valid conclusionsfrom available evidence/data to a specifiedaudience. It may involve producing anargument or an explanation based on thesituation and data given, or providing relevantadditional information, expressedappropriately and clearly for the audience.Developing the proficiency scale for science
involved an iterative process that considered thetheoretical basis of the science framework andthe item analyses from the main study.
Consideration of the framework suggested ascale of scientific literacy which, at its lowerlevels, would involve being able to grasp easierscientific concepts in familiar situations such asrecognising questions that can or cannot bedecided by scientific investigation, recallingscientific facts, or using common scientificknowledge to draw conclusions.
Students at higher levels of proficiency wouldbe able to apply more cognitively demandingconcepts in more complex and occasionallyunfamiliar contexts. This could involve creatingmodels to make predictions, using data asevidence to evaluate different perspectives orcommunicating scientific arguments withprecision.
The difficulty levels indicated by the analysisof the science items in PISA 2000 also gaveuseful information, which was considered alongwith the theoretical framework in developingscale descriptions.
Only limited data were available from thePISA 2000 study, where science was a minor testdomain. For this reason, a strict definition oflevels could not be confidently recommended.The precise definition of levels and investigationof sub-scales will be possible for PISA 2006,using the data from science as a major testdomain.
The PISA 2000 scientific literacy scale isdescribed in Figure 39.
Illustrating the PISA ScientificLiteracy Scale
Several science items have been recommendedfor public release on the basis of the followingcriteria. In addition to indicating the assessmentstyle used in PISA, these items provide a guide tothe descriptive scientific proficiency scales. Theselected items: • exemplify the PISA philosophy of assessing
the students’ preparedness for future liferather than simply testing curriculum-basedknowledge;
• cover a range of proficiency levels;• give consistent, reliable item analyses and
were not flagged as causing problems inindividual countries;
• represent several formats: multiple-choice,short constructed-response, open response;and
• include some partial credit items.The Semmelweis and Ozone units, containing
a total of 10 score points, were released afterPISA 2000. Characteristics of these units aresummarised in Figures 40 and 41.
Pro
fici
ency
sca
les
con
stru
ctio
n
213
PIS
A 2
00
0 t
ech
nic
al r
epor
t
214
SemmelweisThe historical example about Semmelweis’discoveries concerning puerperal fever waschosen for several reasons:• It illustrates the importance of systematic data
collection for explaining a devastatingphenomenon, the death of many women inmaternity wards;
• It demonstrates the need to consideralternative explanations based on the datacollected and on additional observationsregarding the behaviour of the medicalpersonnel;
• It shows how scientific reasoning based onevidence can refute popular beliefs thatprevent a problem from being addressedeffectively; and
• It allows students to expand on the exampleby bringing in their own scientific knowledge.
collected to defend the idea that earthquakes arean unlikely cause of the disease. The graphsshow a similar variation in death rate over time,and the first ward consistently has a higherdeath rate than the second ward. If earthquakeswere the cause, death rates in both wards shouldbe about equal. The graphs suggest thatsomething having to do with the wardsthemselves would explain the difference.
To obtain full credit, students needed to referto the idea that death rates in both wards shouldhave been similar over time if earthquakes werethe cause. Some students came up with answersthat did not refer to Semmelweis’ findings, butto a characteristic of earthquakes that made itvery unlikely that they were the cause, such asthat they are infrequent while the fever isconstantly present. Others came up with veryoriginal well-taken statements such as ‘if it wereearthquakes, why do only women get the
disease, and not men?’or ‘ if so, womenoutside the wardswould also get thatfever’. Although it couldbe argued that thesestudents did notconsider the datacollected by Semmelweis,as the item asks, theywere given partial creditbecause theydemonstrated a certainability to use scientificfacts in drawing aconclusion.
This partial credit item is a good example ofhow one item can be used to exemplify twoparts of the proficiency scale. The first scorepoint is a moderately difficult item because itinvolves using scientific evidence to relate datasystematically to possible conclusions using achain of reasoning that is not given to thestudents. The second score point is at a higherlevel of proficiency because the students have torelate the data given as evidence to evaluatedifferent perspectives.
Sample Item S195Q04 (multiple-choice)
Semmelweis gets an idea from his observations.Sample Item S195Q04 asks students to identifywhich from a set of ideas was most relevant to
Unit Name Unit Q# Score Location ScientificIdentification on PISA Scale Process
Semmelweis 195 2 2 679 c
Semmelweis 195 2 1 651 c
Semmelweis 195 4 1 506 b
Semmelweis 195 5 1 480 a
Semmelweis 195 6 1 521 a
Note: The a, b, and c in the right-hand column refer to the scientific processes: a = understanding scientific concepts; b = understanding the nature of scientificinvestigation; and c = using scientific evidence.
Figure 40: Characteristics of the Science Unit, Semmelweis
Sample Item S195Q02 (open constructed-response)
Sample Item S195Q02 is the first in a unit thatintroduces students to Semmelweis, who took upa position in a Viennese hospital in the 1840sand became puzzled by the remarkably highdeath rate due to puerperal fever in one of thematernity wards. This information is presentedin text and graphs, and the suggestion is thenmade that puerperal fever may be caused byextraterrestrial influences or natural disasters—not uncommon thinking in Semmelweis’ time.Semmelweis tried on several occasions toconvince his colleagues to consider more rationalexplanations. Students are invited to imaginethemselves in his position, and to use the data he
reducing the incidence of puerperal fever.Students need to put two pieces of relevantinformation from the text together: thebehaviour of a medical student and the death ofSemmelweis’ friend from puerperal fever afterdissection. This item exemplifies a low tomoderate level of proficiency because it asksstudents to refer to given data or information todraw their conclusion.
Sample Item S195Q05 (open constructed-response)
Nowadays, most people are well aware thatgerms cause many diseases, and can be killed byheat. Perhaps not many people realise thatroutine procedures in hospitals make use of thisto reduce risks of fever outbreak.
This item invites the student to apply thecommon scientific knowledge that heat killsbacteria to explain why these procedures areeffective. This item is an example of a level oflow to moderate difficulty.
Sample Item S195Q06 (multiple-choice)
Sample Item S195Q06 goes beyond thehistorical example to elicit common scientificknowledge needed to explain a scientificphenomenon. Students are asked to explain whyantibiotics have become less effective over time.To respond correctly, they must know thatfrequent, extended use of antibiotics builds upstrains of bacteria resistant to the antibiotics’initially lethal effects.
This item is located at a moderate level on theproficiency scale because it asks students to usescientific concepts (as opposed to commonscientific knowledge, which is at a lower level) togive explanations.
OzoneThe subject of the Ozone unit is the effect on theenvironment of the depletion of the ozone layer,and is very important as it concerns humansurvival. It has received much media attention.The unit is significant too, because it points tothe successful intervention by the internationalcommunity at a meeting in Montreal an earlyexample of successful global consultation onmatters that extend beyond national boundaries.
The text presented to the students sets thecontext by explaining how ozone molecules andthe ozone layer are formed, and how the layerforms a protective barrier by absorbing ultra-violet radiation from the sun’s rays. Both goodand bad effects of ozone are described. Thecontext of the items involves genuine, importantissues, and gives the opportunity to askquestions about physical, chemical andbiological aspects of science.
Sample Item S253Q01 (open constructed-response)In the first Ozone sample item, the formation ofozone molecules is presented in a novel way. Acomic strip shows three stages of oxygenmolecules being split under the sun’s influenceand then recombined into ozone molecules.Students need to interpret this information andcommunicate it to a person with limitedscientific knowledge. The intention is to assessstudents’ ability to communicate their owninterpretation, which requires an open-endedresponse format.
Many different responses can be given, but togain full credit, students must describe what ishappening in at least two of the three stages ofthe comic strip.
Unit Name Unit Q# Score Location ScientificIdentification on PISA Scale Process
Ozone 253 1 2 695 d
Ozone 253 1 1 641 d
Ozone 253 2 1 655 c
Ozone 253 5 1 560 a
Ozone 270 3 1 542 b
Note: The a, b, c, d in the right-hand column refer to the scientific processes:a = understanding scientific concepts; b = understanding the nature of scientificinvestigation; c = using scientific evidence; and d = communicating scientificdescriptions or arguments.
Figure 41: Characteristics of the Science Unit, Ozone Pro
fici
ency
sca
les
con
stru
ctio
n
215
Students who received partial credit for thisitem are regarded as working at a moderate tohigh level of proficiency because they couldcommunicate a simple scientific description. Fullcredit requires more precision and detail this isan example of the highest proficiency level.
Sample Item S253Q02 (multiple-choice)
Sample Item S253Q02 requires students to linkpart of the text to their own experience ofweather conditions (thunderstorms occurringrelatively close to Earth) in order to draw aconclusion about the nature of the ozoneproduced (‘good’ or ‘bad’). The item requiresdrawing an inference or going beyond the statedinformation.
This is an example of an item at a moderateto high level of proficiency because studentsmust relate data systematically to statements ofpossible conclusions using a chain of reasoningthat is not specifically given to them.
Sample Item S253Q05 (open constructed-response)
In Sample Item S253Q05, students need todemonstrate specific knowledge of a possibleconsequence for human health of the depletionof the ozone layer. The Marking Guide is precisein demanding a particular type of cancer (skincancer). This item is regarded as requiring amoderate level of proficiency because studentsmust use a specific scientific concept inanswering it.
Sample Item S270Q03 (complex multiple-choice)
The final Ozone sample item emphasises theinternational significance of scientific research inhelping to solve environmental problems, such asthe ones raised at the Montreal meeting. Theitem requires students to recognise questions thatcan be answered by scientific investigation and isan example of a moderate level of proficiency.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
216
Constructing andValidating theQuestionnaire indices
Wolfram Schulz
In several cases, the raw data from the contextquestionnaires were recoded or scaled toproduce new variables, referred to as indices.Indices based on direct recoding of responses toone or more variables are called simple indices.Many PISA indices, however, have beenconstructed by applying a scaling methodology,and are referred to as complex indices. Theysummarise student or school representatives’(typically principals) responses to a series ofrelated questions selected from larger constructson the basis of theoretical considerations andprevious research. Structural equation modellingwas used to confirm the theoretically expectedbehaviour of the indices and to validate theircomparability across countries. For this purpose,a model was estimated separately for eachcountry and, collectively, for all OECDcountries.1
This chapter describes the construction of allindices, both simple and complex. It presents theresults of a cross-country validation of thestudent and school indices, and providesdescriptive statistics for selected questionnairevariables. The analysis was done using StructuralEquation Modelling (SEM) for a ConfirmatoryFactor Analysis (CFA) of questionnaire items.
CFA was used to validate the indices, and itemresponse theory techniques were used to producescale scores. An index involving multiplequestions and student responses was scaled usingthe Rasch item response model, and the scalescore is a weighted maximum likelihood estimate,
1 The Netherlands was not included because of its lowparticipation rate.
referred to as a Warm estimator (Warm, 1985)(also see Chapter 9). Dichotomous responsecategories (i.e., yes/no) produce only oneparameter estimate (labelled ‘Delta’ in the figuresreporting item parameters in this chapter). Ratingscales allowing graded responses (i.e., never, somelessons, most lessons, every lesson) produce botha parameter estimate and an estimate of thethresholds between n categories labelled ‘Tau(1)’,‘Tau(2)’, etc. in the figures reporting itemparameters in this chapter.
The scaling proceeded in three steps:• Item parameters were estimated from
calibration samples of students and schools.The student calibration sample consisted ofsub-samples of 500 randomly selectedstudents, and the school calibration sampleconsisted of sub-samples of 99 schools, fromeach participating OECD country.2 Non-OECD countries did not contribute to thecomputation of the item parameters.
2 For the international options, the item responsetheory indices were transformed using the same proce-dures as those applied to the other indices, but onlyOECD countries that had participated in the internationaloptions were included. For Belgium, only the FlemishCommunity chose to include the international options,and it was given equal weight and included. For theUnited Kingdom where only Scotland participated, thedata were not included in the standardisation. Therationale was that in Belgium, the participating regionrepresented half of the population whereas students inScotland are only a small part of the student populationin the United Kingdom.
Chapter 17
• Estimates were computed for all students andall schools by anchoring the item parametersobtained in the preceding step.
• Indices were then standardised so that themean of the index value for the OECDstudent population was zero and the standarddeviation was one (countries being givenequal weight in the standardisation process).This item response theory approach was
adopted for producing index ‘scores’, aftervalidation of the indices with CFA procedures,because (i) it was consistent with the intention thatthe items simply be added together to produce asingle index, and (ii) it provided an elegant way ofdealing with missing item-response data.
Validation Procedures ofIndices
Structural Equation Modelling (SEM) was usedto confirm theoretically expected dimensions orto re-specify the dimensional structure. SEMtakes the measurement error associated with theindicators into account and therefore providesmore reliable estimates for the latent variablesthan classical psychometric methods likeexploratory factor analyses or simplecomputations of alpha reliabilities.
Basically, in CFA with SEM, an expectedcovariance matrix is fitted to a theoretical factorstructure by minimising the differences betweenan expected covariance matrix (Σ, based upon agiven model) and the observed covariance matrix(S, computed from the data).
Measures for the overall fit of a model arethen obtained by comparing the expected Σmatrix with the observed S matrix. If thedifferences are close to zero, then the model fitsthe data, while if they are rather large the modeldoes not fit the data and some re-specificationmay be necessary or, if this is not possible, thetheoretical model should be rejected. Generally,model fit indices are approximations and need tobe interpreted with care. In particular, the chi-squared test statistic for the null hypothesis ofΣ=S becomes a rather poor fit measure withlarger sample sizes because even smalldifferences between matrices are given assignificant deviations.
Though the assessment of model fit was basedon a variety of measures, this report presentsonly the following two indices to illustrate thecomparative model fits:
• The Root Mean Square Error ofApproximation (RMSEA) measures thediscrepancy per degree of freedom for themodel (Browne and Cudeck, 1993). Values of0.05 or less indicate a close fit, values of 0.08or more indicate a reasonable to a large errorof approximation and values greater than 1.0should lead to the rejection of a model.
• The Adjusted Goodness-of-Fit-Index (AGFI)indicates the amount of variance in Sexplained with Σ. Values close to 1.0 indicatea good model fit.
• The difference between a specified model anda null model is measured by the Normed FitIndex (NFI), Non-Normed Fit Index (NNFI)and Comparative Fit Index (CFI). Like AGFI,values close to 1.0 for these three indicesshow a good model fit.Additionally, the explained variance (or item
reliability) in the manifest variables was taken asan additional indicator of model fit i.e., if thelatent variables hardly explain any variance inthe manifest variables, the assumed factorstructure is not confirmed even if the overallmodel fit is reasonable.
Model estimation was done with LISREL(Jöreskog and Sörbom, 1993) using theSTREAMS (Gustafsson and Stahl, 2000)interface program which gives the researcher thepossibility of estimating models for groups ofdata (e.g., different country data sets) andcomparing the model fits across countrieswithout needing to re-estimate the same modelcountry-by-country. Model estimations wereobtained using variance/covariance matrices.3
For each of the indices, the assumed modelwas estimated separately for each country andfor the entire sample. In a few cases, the modelhad to be respecified; in some cases, where thiswas unsuccessful, only the results for themisfitting model are reported.
For the student data, the cross-countryvalidation was based on the calibration sample.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
218
3 For ordinal variables, Jöreskog and Sörbom (1993)recommend using Weighted Least Square Estimation(WLS) with polychoric correlation matrices and itscorresponding asymptotic covariance weight matricesinstead of ML estimation with covariance matrices. Asthis procedure becomes computationally difficult withlarge amounts of data, the use of covariance matriceswas chosen for the analysis of dimensionality reportedhere.
Here, the sample size was deemed appropriatefor this kind of evaluation. For the school data,the complete school data set for each countrywas taken, given the small number of schoolsper country in the calibration sample.
Indices
In the remainder of the chapter, simple indexvariables (those based on direct recoding ofresponses to one or more variables) are describedfirst, followed by complex indices (those that havebeen constructed by applying IRT scalingmethodology), first for students and second, forschools. Item parameter estimates are providedfor each of the scaled indices.
Student Characteristics andFamily Background
Student ageThe variable AGE gives the age of the studentexpressed in number of months and wascomputed as the difference between the date ofbirth and the middle date of the country’s testingwindow.
Family structureStudents were asked to indicate who usually livesat home with them. The variable FAMSTRUC isbased on the first four items of this question(ST04Q01, ST04Q02, ST04Q03 and ST04Q04).It takes four values and indicates the nature of thestudent’s family: (i) single-parent family (studentswho reported living with one of the following:mother, father, female guardian or male guardian);(ii) nuclear family (students who reported livingwith a mother and a father); (iii) mixed family(students who reported living with a mother anda male guardian, a father and a female guardian,or two guardians); and (iv) other responsecombinations.
Number of siblingsStudents were asked to indicate the number ofsiblings older and younger than themselves, orthe same age. NSIB, the total number of siblings,is derived from the three items of questionST25Q01.
Country of birthStudents were asked if they, their mother, and
their father were born in the country ofassessment or in another country. Responseswere then grouped into: (i) students with native-born parents (students born in the country of theassessment, with at least one parent born in thatcountry); (ii) first-generation students (studentsborn in the country of the assessment but whoseparents were both born in another country); and(iii) non-native students (students born outsidethe country of the assessment and whose parentswere also born in another country). The variableis based on recoding ST16Q01 (student),ST16Q02 (mother) and ST16Q03 (father).
Language spoken at homeStudents were asked if the language spoken athome most of the time was the language ofassessment, another official national language,another national dialect or language, or anotherlanguage. The responses (to ST17Q01) werethen grouped into two categories: (i) languagespoken at home most of the time is differentfrom the language of assessment, from otherofficial national languages and from othernational dialects or languages; and (ii) thelanguage spoken at home most of the time is thelanguage of assessment, is another officialnational language, or other national dialect orlanguage.
Birth orderThe birth order of the assessed student,BRTHORD, is derived from the variablesST05Q01 (number of older siblings), ST05Q02(number of younger siblings) and ST05Q03(number of siblings of equal age). It is equal to 0if the student is the only child, 1 if the student isthe youngest child, 2 if the student is a middlechild, and 3 if the student is the oldest child.
Socio-economic statusStudents were asked to report their mother’s andfather’s occupation, and to state whether eachparent was: in full-time paid work; in part-timepaid work; not working but looking for a paidjob; or other. The open-ended responses werethen coded in accordance with the InternationalStandard Classification of Occupations (ISCO1988).
The PISA International Socio-Economic Indexof Occupational Status (ISEI) was derived fromstudent responses on parental occupation. The
Con
stru
ctin
g an
d v
alid
atin
g th
e q
ues
tion
aire
in
dic
es
219
index capture the attributes of occupations thatconvert parents’ education into income, andwere derived by the optimal scaling ofoccupation groups to maximise the indirecteffect of education on income by occupation,and to minimise the direct effect of education onincome net of occupation (both effects are net ofage). For more information on the methodology,see Ganzeboom, de Graaf and Treiman (1992).
Data for mother’s occupation, father’soccupation and the student’s expected occupationat the age of 30, obtained through questionsST08Q01, ST09Q01, ST10Q01, ST11Q01 andST40Q01, were transformed to SEI indices usingthe above procedures. The variables BMMJ,BFMJ and BTHR are the mother’s SEI, father’sSEI and student’s self-expected SEI, respectively.Two new derived variables, ISEI and HISEI, werethen computed from BMMJ and BFMJ. If BFMJis available, ISEI is equal to BFMJ, while if it isnot available but BMMJ is available, then ISEI isequal to BMMJ. HISEI corresponds to the highervalue of BMMJ and BFMJ.4
Parental educationStudents were asked to classify their mother’s andfather’s highest level of education on the basis ofnational qualifications that were then codedaccording to the International StandardClassification of Education (ISCED, OECD, 1999b)to obtain internationally comparable categoriesof educational attainment. The resultingcategories were: did not go to school; completed<ISCED Level 1 only (primary education)>;completed <ISCED Level 2 only (lowersecondary education)>; completed <ISCED Level3B or 3C only (upper secondary education,
PIS
A 2
00
0 t
ech
nic
al r
epor
t
220
4 To capture wider aspects of a student’s family and home background in addition to occupational status, the PISA initial reportuses an index of economic, social and cultural status that is not in the international database. This index was created on the basisof the following variables: the International Socio-Economic Index of Occupational Status (ISEI); the highest level of education ofthe student’s parents converted into years of schooling; the PISA index of family wealth; the PISA index of home educationalresources; and the PISA index of possessions related to ‘classical culture’ in the family home. The ISEI represents the first principalcomponent of the factors described above. The index was constructed to have a mean of 0 and a standard deviation of 1.
Among these components, the most commonly missing data relate to the International Socio-Economic Index of OccupationalStatus (ISEI), parental education, or both. Separate factor analyses were therefore undertaken for all students with valid data for: i) socio-economic index of occupational status, index of family wealth, index of home educational resources and index ofpossessions related to ‘classical culture’ in the family home; ii) years of parental education, index of family wealth, the index ofhome educational resources and index of possessions related to ‘classical culture’ in the family home; and iii) index of familywealth, index of home educational resources and index of possessions related to ‘classical culture’ in the family home. Studentswere then assigned a factor score based on the amount of data available. For this to be done, students had to have data on at leastthree variables.5 Terms in brackets < > were replaced in the national versions of the Student and School questionnaires by the appropriate nationalequivalent. For example, <qualification at ISCED level 5A> was translated in the United States into ‘Bachelor’s Degree, post-graduate certificate program, Master’s degree program or first professional degree program’. Similarly, <classes in the language ofassessment> in Luxembourg was translated into ‘German classes’ or ‘French classes’ depending on whether students received theGerman or French version of the assessment instruments.
aimed in most countries at providing direct entryinto the labour market)>; completed <ISCEDLevel 3A (upper secondary education, aimed inmost countries at gaining entry into tertiaryeducation)>; and completed <ISCED Level 5A,5B or 6 (tertiary education)>.5
The educational level of the father and themother were collected through two questions foreach parent (ST12Q01 and ST14Q01 for themother and ST13Q01 and ST15Q01 for thefather). The variables in the database formother’s and father’s highest level of educationare labelled MISCED and FISCED.
Validating socio-economic status and parentaleducation
In Canada, the Czech Republic, France and theUnited Kingdom, validity studies were carried outto test the reliability of student answers to thequestions on their parents’ occupations. Generallythese studies indicate that useful data on parentaloccupation can be collected from 15-year-oldstudents. Further details of the validity studiesare presented in Appendix 4.
When reviewing the validity studies it shouldbe kept in mind that 15-year-old students’ reportsof their parents’ occupations are not necessarilyless accurate than adults’ reports of their ownoccupations. Individuals have a tendency to inflatetheir own occupations, especially when the data arecollected by personal (including telephone) inter-views. The correspondence between self-reportsand proxy reports of husbands and wives is alsoimperfect. In addition, official records of a parent’soccupation are arguably subject to greatersources of error (job changes, changes in codingschemes, etc.) than the answers of 15-year-olds.
The important issue here is the highcorrelation between parental occupation statusderived from student and self-reports rather thanthe exact correspondence between ISCOoccupational categories. For example, students’and parents’ reports of a parent’s occupationmay result in it being coded to different ISCOcategories but in very similar SEI scores.
In the PISA validity studies carried out in thefour countries, the correlation of SEI scoresbetween parental occupation as reported bystudents and by their parents was high, between0.70 and 0.86. This correlation was judged to behigh since the test/retest correlation of SEI fromadults who were asked their occupation on twooccasions over a short time period typically liesin this range.
The correspondence between broad occupationalcategories, which averaged about 70 per cent, wasgreater for some occupational groups (e.g.professionals) than others (e.g., managers andadministrators). Only 5 per cent of cases showeda large lack of correspondence, for example,between professionals and unskilled labourers.
Finally, the strength of the relationshipbetween SEI and achievement was found to bevery similar using SEI scores derived either fromthe students’ or from parents’ reports.
The desirability of coding to 4-digit ISCOcodes versus 1 or 2-digit ISCO codes was alsoexplored. SEI scores generated from one-digitISCO codes (that is, only nine major groups)proved to substantially change the strength ofthe relationship between occupational status andachievement, usually reducing it but in someinstances increasing it. The results of coding toonly 2-digit ISCO codes were less clear. For therelationship between father’s SEI andachievement, the correlations between SEI scoresconstructed from 2 or 4-digit ISCO codes wereonly slightly different. However with mother’sSEI, sizeable differences did emerge for some
countries, with differences of up to 0.05 in thecorrelation (or a 24 per cent attenuation). It isnot immediately apparent why the correlationshould change for mothers but not for fathers,but it is likely to be due to the greater clusteringof women in particular jobs and the ability of 4-digit coding to distinguish high and low statuspositions within broad job groups. Since men’soccupations are more evenly spread, theprecision of 4-digit coding will have less of animpact.
Given the differences in the correlationbetween mother’s SEI and achievement, it waspreferable (and safer) to code the occupations to4-digit ISCO codes. Additionally, convertingalready-coded occupations to 2-digit codes from4-digit codes is quite a different exercise thandirectly coding the occupation to two digits.More detailed occupational coding is likely toincrease the coder’s familiarity and thereforeaccuracy with the ISCO coding schema.
Parental Interest and Family Relations
Cultural communication
The PISA index of cultural communication wasderived from students’ reports on the frequencywith which their parents (or guardians) engagedwith them in: discussing political or social issues;discussing books, films or television programmes;and listening to classical music. Frequency wasmeasured on a five-point scale of never or hardlyever; a few times a year; about once a month;several times a month; and several times a week.
Scale scores are standardised Warm estimates.Positive values indicate a higher frequency ofcultural communication and negative valuesindicate a lower frequency of culturalcommunication. The item parameter estimatesused for the weighted likelihood estimation aregiven in Figure 42. For the meaning of delta andtau, see the third paragraph of this chapter.
221
In general, how often do your parents: Parameter Estimates
Delta Tau(1) Tau(2) Tau(3) Tau(4)
ST19Q01 discuss political or social issues with you? -0.05 -0.59 0.43 -0.40 0.56
ST19Q02 discuss books, films or television programmes with you? -0.68 -0.52 0.26 -0.32 0.58
ST19Q03 listen to classical music with you? 0.73 0.71 0.10 -0.60 -0.22
Figure 42: Item Parameter Estimates for the Index of Cultural Communication (CULTCOM)
Con
stru
ctin
g an
d v
alid
atin
g th
e q
ues
tion
aire
in
dic
es
Social communicationThe PISA index of social communication wasderived from students’ reports on the frequencywith which their parents (or guardians) engagedwith them in the following activities: discussinghow well they are doing at school; eating <themain meal> with them around a table; andspending time simply talking with them.Frequency was measured on a five-point scale ofnever or hardly ever; a few times a year; aboutonce a month; several times a month; andseveral times a week.
Scale scores are standardised Warm estimates,where positive values indicate higher frequencyof social communication and negative valuesindicate lower frequency of social communication.The item parameters used for the weightedlikelihood estimation are given in Figure 43.
Family educational supportThe PISA index of family educational supportwas derived from students’ reports on howfrequently the mother, father, or brothers andsisters worked with the student on what isregarded nationally as schoolwork. Studentsresponded to each statement on a five-pointscale with the following: never or hardly ever, a
few times a year, about once a month, severaltimes a month and several times a week.
Scale scores are standardised Warm estimates,where positive values indicate higher frequencyof family (parents and siblings) support for thestudent’s schoolwork while negative valuesindicate lower frequency. Item parameters usedfor the weighted likelihood estimation are givenin Figure 44.
Cultural activitiesThe PISA index of activities related to classicalculture was derived from students’ reports on howoften they had, during the preceding year: visited amuseum or art gallery; attended an opera, ballet orclassical symphony concert; or watched livetheatre. Students responded to each statement on afour-point scale with: never or hardly ever, once ortwice a year, about three or four times a year, andmore than four times a year.
Scale scores are standardised Warm estimateswhere positive values indicate higher frequencyand negative values indicate lower frequency ofcultural activities during the year. Itemparameters used for the weighted likelihoodestimation are given in Figure 45.
222
In general, how often do your parents: Parameter EstimatesDelta Tau(1) Tau(2) Tau(3) Tau(4)
ST19Q04 discuss how well you are doing at school? 0.052 -0.586 -0.012 0.069 0.53
ST19Q05 eat <the main meal> with you around a table? -0.165 0.521 0.605 -0.504 -0.62
ST19Q06 spend time just talking to you? 0.113 0.011 0.143 -0.289 0.14
Figure 43: Item Parameter Estimates for the Index of Social Communication (SOCCOM)
How often do the following people work Parameter Estimateswith you on your <schoolwork>? Delta Tau(1) Tau(2) Tau(3) Tau(4)
ST20Q01 Your mother -0.24 -0.24 -0.07 -0.40 0.71
ST20Q02 Your father 0.01 -0.24 -0.13 -0.40 0.77
ST20Q03 Your brothers and sisters 0.24 0.22 -0.23 -0.38 0.39
Figure 44: Item Parameter Estimates for the Index of Family Educational Support (FAMEDSUP)
During the past year, how often have you Parameter Estimatesparticipated in these activities? Delta Tau(1) Tau(2) Tau(3)
ST18Q02 Visited a museum or art gallery. -0.48 -1.54 0.77 0.77
ST18Q04 Attended an opera, ballet or classical symphony concert. 0.76 -0.46 0.52 -0.05
ST18Q05 Watched live theatre. -0.28 -1.42 0.79 0.63
Figure 45: Item Parameter Estimates for the Index of Cultural Activities (CULTACTV)
PIS
A 2
00
0 t
ech
nic
al r
epor
t
Item dimensionality and reliability of‘parental interest and family relations’ scales
Structural Equation Modelling confirmed theexpected dimensional structure of the items usedfor the parental interest and family relationsindices. All fit measures indicated a good model fitfor the international sample (RMSEA = 0.045,AGFI = 0.97, NNFI = 0.93, and CFI = 0.95), andthe estimated correlation between latent factorswas highest between CULTCOM and SOCCOM,
with 0.69, and lowest for CULTACTV andFAMEDSUP, with 0.17. For the country sub-samples, the RMSEA ranged between 0.032 and0.071, showing that the dimensional structurewas also confirmed across countries.
Table 74 shows the reliability for each of theparental interest and family relations variables foreach participating country. Internal consistency isgenerally rather moderate but still satisfactorygiven the low number of items for each scale.
Con
stru
ctin
g an
d v
alid
atin
g th
e q
ues
tion
aire
in
dic
es
223
Table 74: Reliability of Parental Interest and Family Relations Scales
Country CULTCOM SOCCOM FAMEDSUP CULTACTV
Australia 0.62 0.58 0.66 0.60Austria 0.55 0.53 0.70 0.66Belgium 0.51 0.56 0.66 0.57Canada 0.58 0.57 0.68 0.63Czech Republic 0.53 0.57 0.63 0.64Denmark 0.57 0.64 0.56 0.60Finland 0.52 0.46 0.64 0.62France 0.49 0.53 0.58 0.55Germany 0.55 0.50 0.64 0.64Greece 0.44 0.62 0.70 0.52Hungary 0.50 0.58 0.65 0.66Iceland 0.58 0.54 0.65 0.59Ireland 0.53 0.56 0.69 0.54Italy 0.49 0.55 0.59 0.57Japan 0.59 0.59 0.71 0.50Korea 0.63 0.67 0.74 0.67Luxembourg 0.53 0.58 0.61 0.68Mexico 0.50 0.72 0.66 0.62New Zealand 0.57 0.56 0.69 0.59Norway 0.60 0.65 0.69 0.60Poland 0.57 0.63 0.66 0.68Portugal 0.52 0.62 0.68 0.55Spain 0.50 0.61 0.63 0.56Sweden 0.58 0.57 0.60 0.63Switzerland 0.58 0.47 0.66 0.60United Kingdom 0.51 0.60 0.66 0.63United States 0.63 0.68 0.69 0.65Mean OECD countries 0.55 0.58 0.66 0.63Brazil 0.57 0.65 0.67 0.54Latvia 0.50 0.58 0.65 0.65Liechtenstein 0.59 0.50 0.68 0.58Russian Federation 0.49 0.62 0.67 0.68Netherlands 0.54 0.64 0.67 0.59
Note: ‘Mean OECD countries’ does not include the Netherlands, which did not meet PISA’s sampling standards.
Family Possessions
Family wealthThe PISA index of family wealth was derivedfrom students’ reports on: (i) the availability intheir home of a dishwasher, a room of theirown, educational software, and a link to theInternet; and (ii) the numbers of cellular phones,televisions, computers, motor cars andbathrooms at home.
Scale scores are standardised Warm estimates,where positive values indicate more wealth-related possessions and negative values indicatefewer wealth-related possessions. The itemparameters used for the weighted likelihoodestimation are given in Figure 46.
Home educational resources The PISA index of home educational resourceswas derived from students’ reports on theavailability and number of the following items intheir home: a dictionary, a quiet place to study, adesk for study, text books and calculators.
Scale scores are standardised Warm estimateswhere positive values indicate possession ofmore educational resources and negative valuesindicate possession of fewer educationalresources by the student’s family. The itemparameters used for the weighted likelihoodestimation are given in Figure 47.
Possessions related to ‘classical’ culture in thefamily home
The PISA index of possessions related to‘classical’ culture in the family home was derivedfrom students’ reports on the availability of thefollowing items in their home: classical literature(examples were given), books of poetry andworks of art (examples were given).
Scale scores are standardised Warm estimates,where positive values indicate a greater numberof cultural possessions while negative valuesindicate fewer cultural possessions in the student’shome. The item parameters used for the weightedlikelihood estimation are given in Figure 48.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
224
In your home, do you have: Parameter EstimatesDelta Tau(1) Tau(2) Tau(3)
ST21Q01 a dishwasher? 0.204
ST21Q02 a room of your own? -1.421
ST21Q03 educational software? 0.085
ST21Q04 a link to the Internet? 0.681
How many of these do you have at your home?
ST22Q01 <Cellular> phone 0.225 -0.657 0.137 0.52
ST22Q02 Television -1.332 -2.31 0.506 1.804
ST22Q04 Computer 1.133 -1.688 0.628 1.06
ST22Q06 Motor car 0.379 -1.873 0.262 1.611
ST22Q07 Bathroom 0.046 -3.364 0.973 2.391
Figure 46: Item Parameter Estimates for the Index of Family Wealth (WEALTH)
In your home, do you have: Parameter EstimatesDelta Tau(1) Tau(2) Tau(3)
ST21Q05 a dictionary? -1.19
ST21Q06 a quiet place to study? 0.08
ST21Q07 a desk for study? 0.05
ST21Q08 text books? 0.27
How many of these do you have at your home?ST22Q03 Calculator 0.79 -1.19 0.32 0.88
Figure 47: Item Parameter Estimates for the Index of Home Educational Resources (HEDRES)
Item dimensionality and reliabilityof ‘family possessions’ scales
Structural Equation Modellinglargely confirmed the dimensionalstructure of the items on familypossessions. For the internationalsample both NNFI and CFI wererather low (< 0.85) indicating aconsiderable lack of model fit, butAGFI (0.93) and RMSEA (0.063)showed a still acceptable fit.Estimated correlations between thelatent factors were 0.40 forWEALTH and HEDRES, 0.21 forWEALTH and CULTPOSS, and0.52 for CULTPOSS and HEDRES.The RMSEA measures for thecountry sub-samples ranged from0.050 to 0.079.
Table 75 shows the reliability foreach of the four family possessionsvariables for each participant.Whereas the reliability for WEALTHand CULTPOSS are reasonable, theinternal consistency for HEDRES israther low. Though the reliability ofthese indices might be consideredlower than desirable, they wereretained since they can be interpretedas simple counts of the number ofpossessions. The lower reliabilitysimply suggests that these counts canbe made up of very different subsetsof items.
Con
stru
ctin
g an
d v
alid
atin
g th
e q
ues
tion
aire
in
dic
es
225
In your home, do you have: Parameter EstimatesDelta
ST21Q09 classical literature (e.g., <Shakespeare>)? 0.14
ST21Q10 books of poetry? 0.04
ST21Q11 works of art (e.g., paintings)? -0.18
Figure 48: Item Parameter Estimates for the Index of Possessions Related to ‘Classical Culture’ in theFamily Home (CULTPOSS)
Table 75: Reliability of Family Possessions Scales
Country WEALTH HEDRES CULTPOSS
Australia 0.66 0.43 0.64Austria 0.62 0.28 0.57Belgium 0.61 0.43 0.58Canada 0.69 0.43 0.64Czech Republic 0.67 0.38 0.54Denmark 0.66 0.31 0.60Finland 0.62 0.33 0.65France 0.61 0.30 0.62Germany 0.65 0.31 0.59Greece 0.70 0.30 0.48Hungary 0.70 0.27 0.55Iceland 0.64 0.43 0.57Ireland 0.66 0.35 0.58Italy 0.67 0.22 0.56Japan 0.50 0.20 0.60Korea 0.61 0.17 0.54Luxembourg 0.67 0.52 .067Mexico 0.80 0.48 0.65New Zealand 0.68 0.47 0.65Norway 0.60 0.45 0.69Poland 0.76 0.38 0.53Portugal 0.75 0.29 0.61Spain 0.67 0.26 0.60Sweden 0.68 0.38 0.65Switzerland 0.63 0.31 0.61United Kingdom 0.65 0.46 0.66United States 0.72 0.53 0.64Mean OECD countries 0.70 0.36 0.59Brazil 0.78 0.45 0.50Latvia 0.68 0.32 0.57Liechtenstein 0.60 0.41 0.56Russian Federation 0.63 0.30 0.47Netherlands 0.53 0.29 0.56
Note: ‘Mean OECD countries’ does not include the Netherlands,which did not meet PISA’s sampling standards.
Instruction and LearningInstruction time
Three indices give the time in terms of numberof minutes spent each week at school in thethree assessed domains. The variables werelabelled RMINS for reading courses, MMINS formathematics courses and SMINS for sciencecourses. They are simply the product of thecorresponding item of student question 27(ST27Q01, ST27Q03, ST27Q05) (number ofclass periods in <test language>, mathematicsand science per week) and school question 6(SC06Q03) (number of minutes per single classperiod).
Time spent on homeworkThe PISA index of time spent on homework wasderived from students’ reports on the amount oftime they devoted to homework in the <testlanguage>, in mathematics and in science.Students responded to this on a four-point scale:no time, less than 1 hour a week, between 1 and3 hours a week, and 3 hours or more a week.
Scale scores are standardised Warm estimateswhere positive values indicate more and negativevalues indicate less time spent on homework.The item parameters used for the weightedlikelihood estimation are given in Figure 49.
Teacher supportThe PISA index of teacher support was derivedfrom students’ reports on the frequency withwhich the teacher: shows an interest in everystudent’s learning; gives students an opportunityto express opinions; helps students with theirwork; continues teaching until the studentsunderstand; does a lot to help students; andhelps students with their learning. A four-pointscale was used with response categories: never,some lessons, most lessons and every lesson.
Scale scores are standardised Warm estimateswhere positive values indicate higher levels andnegative values indicate lower levels of teachersupport. The item parameters used for theweighted likelihood estimation are given inFigure 50.
Achievement pressThe PISA index of achievement press was derivedfrom students’ reports on the frequency withwhich, in their <test language> lesson, the teacher:wants students to work hard; tells students thatthey can do better; and does not like it whenstudents deliver <careless> work; and studentshave to learn a lot. A four-point scale was usedwith response categories: never, some lessons,most lessons and every lesson.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
226
On average, how much time do you spend each Parameter Estimatesweek on homework and study in these subject areas? Delta Tau(1) Tau(2) Tau(3)
ST33Q01 <test language> 0.05 -2.59 -0.04 2.62
ST33Q02 <mathematics> -0.23 -2.33 -0.01 2.34
ST33Q03 <science> 0.17 -2.04 0.02 2.02
Figure 49: Item Parameter Estimates for the Index of Time Spent on Homework (HMWKTIME)
How often do these things happen in your Parameter Estimates<test language> lessons? The teacher: Delta Tau(1) Tau(2) Tau(3)
ST26Q05 shows an interest in every student’s learning. 0.22 -1.78 0.36 1.42
ST26Q06 gives students an opportunity to express opinions. -0.35 -1.80 0.35 1.45
ST26Q07 helps students with their work. -0.04 -1.95 0.34 1.61
ST26Q08 continues teaching until the students understand. -0.02 -1.97 0.31 1.66
ST26Q09 does a lot to help students. -0.02 -2.15 0.30 1.86
ST26Q10 helps students with their learning. 0.20 -1.86 0.23 1.63
Figure 50: Item Parameter Estimates for the Index of Teacher Support (TEACHSUP)
Scale scores are standardised Warm estimateswhere positive values indicate higher levels andnegative values indicate lower levels ofachievement press. The item parameters used forthe weighted likelihood estimation are given inFigure 51.
Con
stru
ctin
g an
d v
alid
atin
g th
e q
ues
tion
aire
in
dic
es
227
Item dimensionality and reliability of‘instruction and learning’ scales
Structural Equation Modelling confirmed theexpected dimensional structure of the items usedfor the instruction and learning indices. All fitmeasures indicated a close model fit for theinternational sample (RMSEA = 0.046, AGFI =0.97, NNFI = 0.96 and CFI = 0.97), and theestimated correlations between latent factorswere 0.27 for TEACHSUP and ACHPRESS,0.18 for TEACHSUP and HMWKTIME, and
0.15 for ACHPRESS and HMWKTIME. TheRMSEA for the country sub-samples rangedbetween 0.043 and 0.080, showing that thedimensional structure was similar acrosscountries.
Table 76 shows the reliability for each of thefour instruction and learning variables for eachparticipant. Internal consistency is high for bothHMWKTIME and TEACHSUP but onlymoderate for ACHPRESS.
How often do these things happen in your Parameter Estimates<test language> lessons? Delta Tau(1) Tau(2) Tau(3)
ST26Q02 The teacher wants students to work hard. -0.37 -1.33 0.36 0.97
ST26Q03 The teacher tells students that they can do better. 0.22 -1.63 0.65 0.98
ST26Q04 The teacher does not like it when students deliver 0.19 -1.36 0.62 0.74<careless> work.
ST26Q15 Students have to learn a lot. -0.04 -1.76 0.42 1.34
Figure 51: Item Parameter Estimates for the Index of Achievement Press (ACHPRESS)
Table 76: Reliability of Instruction and Learning Scales
Country HMWKTIME TEACHSUP ACHPRESS
Australia 0.81 0.90 0.53Austria 0.57 0.87 0.59Belgium 0.77 0.85 0.57Canada 0.78 0.90 0.55Czech Republic 0.72 0.80 0.67Denmark 0.71 0.86 0.37Finland 0.79 0.88 0.62France 0.74 0.87 0.62Germany 0.66 0.86 0.57Greece 0.84 0.86 0.50Hungary 0.69 0.85 0.54Iceland 0.72 0.87 0.62Ireland 0.78 0.89 0.50Italy 0.70 0.85 0.51Japan 0.86 0.89 0.47
PIS
A 2
00
0 t
ech
nic
al r
epor
t
228
School and Classroom Climate
Disciplinary climateThe PISA index of disciplinary climatesummarises students’ reports on the frequencywith which, in their <test language> lesson: theteacher has to wait a long time for students to<quieten down>; students cannot work well;students don’t listen to what the teacher says;students don’t start working for a long time afterthe lesson begins; there is noise and disorder;
and, at the start of class, more than five minutesare spent doing nothing. A four-point scale wasused with response categories never, somelessons, most lessons and every lesson.
Scale scores are standardised Warm estimateswhere positive values indicate more positiveperception of the disciplinary climate andnegative values indicate less positive perceptionof the disciplinary climate. The item parametersused for the weighted likelihood estimation aregiven in Figure 52.
How often do these things happen in your Parameter Estimates<test language> lessons? Delta Tau(1) Tau(2) Tau(3)
ST26Q01 The teacher has to wait a long time for students to -0.26 -2.48 0.95 1.53<quieten down>.
ST26Q12 Students cannot work well. 0.47 -2.52 1.02 1.50
ST26Q13 Students don’t listen to what the teacher says. 0.11 -2.79 1.08 1.71
ST26Q14 Students don’t start working for a long time after 0.21 -2.23 0.76 1.47the lesson begins.
ST26Q16 There is noise and disorder. -0.10 -2.09 0.93 1.16
ST26Q17 At the start of class, more than five minutes are spent -0.43 -1.56 0.72 0.84doing nothing.
Figure 52: Item Parameter Estimates for the Index of Disciplinary Climate (DISCLIM)
Table 76 (cont.)Country HMWKTIME TEACHSUP ACHPRESS
Korea 0.88 0.83 0.38Luxembourg 0.67 0.88 0.62Mexico 0.74 0.83 0.49New Zealand 0.80 0.89 0.51Norway 0.82 0.88 0.60Poland 0.79 0.86 0.61Portugal 0.76 0.88 0.52Spain 0.79 0.90 0.53Sweden 0.74 0.88 0.55Switzerland 0.68 0.86 0.55United Kingdom 0.79 0.89 0.44United States 0.77 0.91 0.54Mean OECD countries 0.76 0.87 0.54Brazil 0.80 0.86 0.47Latvia 0.73 0.83 0.64Liechtenstein 0.76 0.84 0.55Russian Federation 0.80 0.84 0.53
Netherlands 0.69 0.82 0.49
Note: ‘Mean OECD countries’ does not include the Netherlands, which did notmeet PISA’s sampling standards.
Teacher-student relationsThe PISA index of teacher-student relations wasderived from students’ reports on their level ofagreement with the following statements:students get along well with most teachers; mostteachers are interested in students’ well-being;most of my teachers really listen to what I haveto say; if I need extra help, I will receive it frommy teachers; and most of my teachers treat mefairly. A four-point scale was used with responsecategories strongly disagree, disagree, agree andstrongly agree.
Scale scores are standardised Warm estimateswhere positive values indicate more positiveperceptions of student−teacher relations andnegative values indicate less positive perceptionsof student−teacher relations. The itemparameters used for the weighted likelihoodestimation are given in Figure 53.
Sense of belongingThe PISA index of sense of belonging wasderived from students’ reports on whether theirschool is a place where they: feel like anoutsider, make friends easily, feel like theybelong, feel awkward and out of place, otherstudents seem to like them, or feel lonely. A four-
point scale was used with response categories:strongly disagree, disagree, agree and stronglyagree.
Scale scores are standardised Warm estimateswhere positive values indicate more positiveattitudes towards school and negative valuesindicate less positive attitudes towards school.The item parameters used for the weightedlikelihood estimation are given in Figure 54.
Item dimensionality and reliability of ‘schooland classroom climate’ scales
For the international sample, the ConfirmatoryFactor Analysis showed a good model fit withRMSEA = 0.053, AGFI = 0.95, NNFI = 0.92and CFI = 0.94. The correlation between latentfactors is -0.23 for STUDREL and DISCLIM,-0.09 for BELONG and DISCLIM, and 0.29for BELONG and STUDREL. For the countrysub-samples, the RMSEA ranged between 0.050and 0.073, showing that the dimensionalitystructure is similar across countries.
Table 77 shows the reliability for each of thethree school and classroom climate variables foreach participating country. Internal consistencyis very high for all three indices.
229
How much do you disagree or agree with each of the Parameter Estimatesfollowing statements about teachers at your school? Delta Tau(1) Tau(2) Tau(3)
ST30Q01 Students get along well with most teachers. 0.22 -2.59 -0.76 3.35
ST30Q02 Most teachers are interested in students’ well-being. 0.10 -2.48 -0.70 3.18
ST30Q03 Most of my teachers really listen to what I have to say. 0.14 -2.57 -0.53 3.09
ST30Q04 If I need extra help, I will receive it from my teachers. -0.26 -2.20 -0.81 3.01
ST30Q05 Most of my teachers treat me fairly. -0.20 -2.04 -0.96 3.00
Figure 53: Item Parameter Estimates for the Index of Teacher-Student Relations (STUDREL)
My school is a place where: Parameter EstimatesDelta Tau(1) Tau(2) Tau(3)
ST31Q01 I feel like an outsider (or left out of things). (rev.) -0.57 -0.94 -0.68 1.62
ST31Q02 I make friends easily. 0.03 -1.83 -0.94 2.77
ST31Q03 I feel like I belong. 0.65 -1.49 -0.92 2.41
ST31Q04 I feel awkward and out of place. (rev.) -0.18 -1.34 -0.56 1.89
ST31Q05 other students seem to like me. 0.59 -1.98 -1.27 3.25
ST31Q06 I feel lonely. (rev.) -0.52 -1.18 -0.64 1.81
Note: Items marked ‘rev.’ had their response categories reversed before scaling.
Figure 54: Item Parameter Estimates for the Index of Sense of Belonging (BELONG)
Con
stru
ctin
g an
d v
alid
atin
g th
e q
ues
tion
aire
in
dic
es
PIS
A 2
00
0 t
ech
nic
al r
epor
t
230
Reading Habits
Engagement in readingThe PISA index of engagement in reading wasderived from students’ level of agreement withthe following statements: I read only if I have to;
reading is one of my favourite hobbies; I liketalking about books with other people; I find ithard to finish books; I feel happy if I receive abook as a present; for me, reading is a waste oftime; I enjoy going to a bookstore or a library; I
Table 77: Reliability of School and Classroom Climate Scales
Country DISCLIM STUDREL BELONG
Australia 0.84 0.83 0.83Austria 0.81 0.80 0.80Belgium 0.82 0.78 0.73Canada 0.83 0.80 0.84Czech Republic 0.81 0.77 0.69Denmark 0.80 0.82 0.75Finland 0.84 0.78 0.83France 0.82 0.78 0.75Germany 0.80 0.76 0.79Greece 0.71 0.71 0.73Hungary 0.84 0.77 0.76Iceland 0.83 0.81 0.83Ireland 0.86 0.77 0.84Italy 0.82 0.75 0.72Japan 0.80 0.85 0.78Korea 0.81 0.82 0.69Luxembourg 0.77 0.80 0.77Mexico 0.70 0.70 0.71New Zealand 0.85 0.78 0.83Norway 0.83 0.82 0.82Poland 0.81 0.78 0.68Portugal 0.77 0.77 0.69Spain 0.80 0.80 0.71Sweden 0.82 0.81 0.81Switzerland 0.77 0.82 0.75United Kingdom 0.87 0.82 0.83United States 0.83 0.83 0.86Mean OECD countries 0.81 0.79 0.77Brazil 0.76 0.76 0.77Latvia 0.79 0.72 0.71Liechtenstein 0.76 0.82 0.81Russian Federation 0.80 0.70 0.71
Netherlands 0.83 0.73 0.76
Note: ‘Mean OECD countries’ does not include the Netherlands, which didnot meet PISA’s sampling standards.
read only to get information that I need; and Icannot sit still and read for more than a fewminutes. A four-point scale was used withresponse categories strongly disagree, disagree,agree and strongly agree.
Scale scores are standardised Warm estimateswhere positive values indicate more positiveattitudes towards reading and negative valuesindicate less positive attitudes towards reading.The item parameters used for the weightedlikelihood estimation are given in Figure 55.
Reading diversityThe PISA index on reading diversity was derivedfrom dichotomised student reports on how often(never to a few times a year versus about once amonth or more often) they read: magazines, comicbooks, fiction, non-fiction books, email and webpages, and newspapers. It is interpreted as anindicator of the variety of student reading sources.
Scale scores are standardised Warm estimateswhere positive values indicate higher levels andnegative values lower levels of reading diversity.
The item parameters used for the weightedlikelihood estimation are given in Figure 56.
Item dimensionality and reliability of‘reading’ scales
A CFA showed a still acceptable fit of the two-dimensional model for the international sample(RMSEA = 0.078, AGFI = 0.91, NNFI = 0.89and CFI = 0.91), with a correlation betweenlatent factors of 0.66. The RMSEA rangedbetween 0.046 and 0.127 across OECDcountries. The lack of fit is largely due to thecommon variance between negatively phrased,and therefore, inverted, items, which is notexplained in the model.
Table 78 shows the reliability for each of thefour reading variables for each participatingcountry. Reliability is generally rather low forDIVREAD but as this is a simple indexmeasuring only the variation in reading material,the low internal consistency was not considereda major concern.
How much do you disagree or agree with these Parameter Estimatesstatements about reading? Delta Tau(1) Tau(2) Tau(3)
ST35Q01 I read only if I have to. (rev.) -0.42 -1.24 -0.18 1.42ST35Q02 Reading is one of my favourite hobbies. 0.83 -1.61 -0.01 1.62ST35Q03 I like talking about books with other people. 1.02 -1.82 -0.28 2.09ST35Q04 I find it hard to finish books. (rev.) -0.55 -1.45 -0.23 1.68ST35Q05 I feel happy if I receive a book as a present. 0.54 -1.59 -0.52 2.11ST35Q06 For me, reading is a waste of time. (rev.) -0.91 -0.87 -0.67 1.54ST35Q07 I enjoy going to a bookstore or a library. 0.45 -1.57 -0.36 1.92ST35Q08 I read only to get information that I need. (rev.) -0.05 -1.85 -0.05 1.90ST35Q09 I cannot sit still and read for more than a few minutes. (rev.)-0.91 -0.96 -0.46 1.42Note: Items marked ‘rev.’ had their response categories reversed before scaling.
Figure 55: Item Parameter Estimates for the Index of Engagement in Reading (JOYREAD)
How often do you read these materials because you want to? Parameter EstimatesDelta
ST36Q01 Magazines -1.52ST36Q02 Comic books 0.75ST36Q03 Fiction (novels, narratives, stories) 0.60ST36Q04 Non-fiction books 1.04ST36Q05 Emails and Web pages 0.16ST36Q06 Newspapers -1.03
Note: Items were dichotomised as follows (0 = 1,2 = never or a few times a year); (1 = 3,4,5 = about once a month,several times a month or several times a week).
Figure 56: Item Parameter Estimates for the Index of Reading Diversity (DIVREAD) Con
stru
ctin
g an
d v
alid
atin
g th
e q
ues
tion
aire
in
dic
es
231
Table 78: Reliability of Reading Habits Scales
Country JOYREAD DIVREAD
Australia 0.76 0.51Austria 0.76 0.34Belgium 0.74 0.55Canada 0.77 0.50Czech Republic 0.72 0.38Denmark 0.75 0.58Finland 0.75 0.43France 0.71 0.54Germany 0.76 0.40Greece 0.66 0.48Hungary 0.70 0.53Iceland 0.72 0.48Ireland 0.76 0.48Italy 0.71 0.48Japan 0.66 0.44Korea 0.68 0.59Luxembourg 0.71 0.53Mexico 0.61 0.61New Zealand 0.73 0.49Norway 0.75 0.44Poland 0.69 0.50Portugal 0.70 0.50Spain 0.70 0.53Sweden 0.75 0.46Switzerland 0.76 0.46United Kingdom 0.75 0.50United States 0.76 0.61Mean OECD countries 0.72 0.50Brazil 0.70 0.59Latvia 0.64 0.44Liechtenstein 0.73 0.50Russian Federation 0.64 0.51Netherlands 0.74 0.55
Note: ‘Mean OECD countries’ does not include theNetherlands, which did not meet PISA’s samplingstandards.
Motivation and Interest
The following indices were derived from theinternational option on Self-Regulated Learning,which about a quarter of the countries chose notto administer.
Instrumental motivationThe PISA index of instrumental motivation wasderived from students’ reports on how often theystudy to: increase their job opportunities; ensurethat their future will be financially secure; andenable them to get a good job. A four-point scalewas used with response categories almost never,sometimes, often and almost always.
Scale scores are standardised Warm estimateswhere positive values indicate higher levels andnegative values indicate lower levels ofendorsement of instrumental motivation forlearning. The item parameters used for theweighted likelihood estimation are given inFigure 57.
Interest in reading The PISA index of interest in reading wasderived from student agreement with the threestatements listed in Figure 58. A four-point scalewas used with response categories disagree,disagree somewhat, agree somewhat and agree.For information on the conceptual underpinningof the index see Baumert, Gruehn, Heyn, Köllerand Schnabel (1997).
Scale scores are standardised Warm estimateswhere positive values indicate higher levels andnegative values indicate lower levels of interestin reading. The item parameters used for theweighted likelihood estimation are given inFigure 58.
232
How often do these things apply to you? Parameter EstimatesDelta Tau(1) Tau(2) Tau(3)
CC01Q06 I study to increase my job opportunities. 0.04 -2.20 -0.01 2.20
CC01Q14 I study to ensure that my future will be financially secure. 0.26 -2.07 0.02 2.05
CC01Q22 I study to get a good job. -0.30 -2.12 0.01 2.11
Figure 57: Item Parameter Estimates for the Index of Instrumental Motivation (INSMOT)
PIS
A 2
00
0 t
ech
nic
al r
epor
t
Interest in mathematics The PISA index of interest in mathematics wasderived from students’ responses to the threeitems listed in Figure 59. A four-point scale withthe response categories disagree, disagreesomewhat, agree somewhat and agree was used.For information on the conceptual underpinningof the index see Baumert, Gruehn, Heyn, Köllerand Schnabel (1997).
Scale scores are standardised Warm estimateswhere positive values indicate a greater interestin mathematics and negative values a lowerinterest in mathematics. The item parametersused for the weighted likelihood estimation aregiven in Figure 59.
Item dimensionality and reliability of‘motivation and interest’ scales
In a CFA, the fit for the three-dimensional modelwas acceptable (RMSEA = 0.040, AGFI = 0.98,NNFI = 0.98 and CFI = 0.99), and thecorrelation between the three latent dimensionswas 0.13 for INTREA and INSMOT, 0.27 forINTMAT and INSMOT, and 0.11 for INTMATand INTREA. Across countries, the RMSEAranged between 0.014 and 0.074, whichconfirmed the dimensional structure for allcountries.
Table 79 shows the reliability for each of thethree motivation and interest variables for eachparticipating country. For all three scales, theinternal consistency is satisfactory across countries.
233
How much do you disagree or agree with each of Parameter Estimatesthe following? Delta Tau(1) Tau(2) Tau(3)
CC02Q06 Because reading is fun, I wouldn’t want to give it up. 0.18 -1.55 -0.06 1.61CC02Q13 I read in my spare time. 0.17 -1.45 -0.19 1.64CC02Q17 When I read, I sometimes get totally absorbed. -0.35 -1.40 -0.15 1.55
Figure 58: Item Parameter Estimates for the Index of Interest in Reading (INTREA)
How much do you disagree or agree with each of Parameter Estimatesthe following? Delta Tau(1) Tau(2) Tau(3)
CC02Q01 When I do mathematics, I sometimes get totally absorbed. -0.02 -1.55 -0.27 1.83
CC02Q10 Because doing mathematics is fun, I wouldn’t want to 0.27 -1.30 -0.13 1.42give it up.
CC02Q21 Mathematics is important to me personally. -0.25 -1.20 -0.28 1.48
Figure 59: Item Parameter Estimates for the Index of Interest in Mathematics (INTMAT)
Con
stru
ctin
g an
d v
alid
atin
g th
e q
ues
tion
aire
in
dic
es
Table 79: Reliability of Motivation and Interest Scales
COUNTRY INSMOT INTREA INTMAT
Australia 0.83 0.83 0.72Austria 0.81 0.84 0.72Belgium (Fl.) 0.84 0.87 0.78Czech Republic 0.81 0.86 0.68Denmark 0.79 0.85 0.83Finland 0.82 0.88 0.82Germany 0.83 0.85 0.76Hungary 0.85 0.87 0.75Iceland 0.81 0.78 0.66Ireland 0.85 0.84 0.71Italy 0.84 0.82 0.72Korea 0.72 0.82 0.83Luxembourg 0.82 0.81 0.72Mexico 0.69 0.61 0.51New Zealand 0.85 0.83 0.75Norway 0.83 0.85 0.79Portugal 0.84 0.81 0.73Sweden 0.85 0.75 0.77Switzerland 0.78 0.84 0.77United States 0.83 0.82 0.74Mean OECD countries 0.82 0.83 0.75Brazil 0.83 0.73 0.65Latvia 0.69 0.75 0.72Liechtenstein 0.83 0.84 0.73Russian Federation 0.76 0.78 0.74Netherlands 0.82 0.84 0.76Scotland 0.79 0.86 0.72
Note: ‘Mean OECD countries’ does not include the Netherlands, which didnot meet PISA’s sampling standards. The Flemish Community ofBelgium constitutes half of Belgium’s population and has been given thestatus of a country for these analyses.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
234
Learning Strategies
Control strategiesThe PISA index of control strategies was derivedfrom students’ responses to the five items listedin Figure 60. A four-point scale was used withthe response categories almost never, sometimes,often and almost always. For information on theconceptual underpinning of the index seeBaumert, Heyn and Köller (1994).
Scale scores are standardised Warm estimateswhere positive values indicate higher frequency
and negative values indicate lower frequency ofself-reported use of control strategies. The itemparameters used for the weighted likelihoodestimation are given in Figure 60.
MemorisationThe PISA index of memorisation strategies wasderived from students’ reports on the four items inFigure 61. A four-point scale was used with theresponse categories almost never, sometimes, oftenand almost always. For information on theconceptual underpinning of the index see Baumert,Heyn and Köller (1994) and Pintrich et al. (1993).
Scale scores are standardised Warm estimateswhere positive values indicate higher frequencyand negative values indicate lower frequency ofself-reported use of memorisation as a learningstrategy. Item parameters used for the weightedlikelihood estimation are given in Figure 61.
ElaborationThe PISA index of elaboration strategies wasderived from students’ reports on the four itemsin Figure 62. A four-point scale with theresponse categories almost never, sometimes,often and almost always was used. Forinformation on the conceptual underpinning ofthe index, see Baumert, Heyn and Köller (1994).
Scale scores are standardised Warm estimateswhere positive values indicate higher frequencyand negative values indicate lower frequency ofself-reported use of elaboration as a learningstrategy. The item parameters used for the weightedlikelihood estimation are given in Figure 62.
How often do these things apply to you? Parameter EstimatesWhen I study, Delta Tau(1) Tau(2) Tau(3)
CC01Q03 I start by figuring out exactly what I need to learn. -0.13 -1.73 0.1 1.63CC01Q13 I force myself to check to see if I remember what I 0.23 -1.75 0.23 1.52
have learned.CC01Q19 I try to figure out which concepts I still haven’t 0.02 -2.43 0.22 2.21
really understood.CC01Q23 I make sure that I remember the most important things. -0.54 -2.09 0.07 2.01CC01Q27 and I don’t understand something I look for additional 0.43 -1.99 0.24 1.74
information to clarify this.
Figure 60: Item Parameter Estimates for the Index of Control Strategies (CSTRAT)
How often do these things apply to you? Parameter EstimatesWhen I study, Delta Tau(1) Tau(2) Tau(3)
CC01Q01 I try to memorise everything that might be covered. -0.26 -1.81 0.30 1.51CC01Q05 I memorise as much as possible. -0.10 -1.63 0.17 1.46CC01Q10 I memorise all new material so that I can recite it. 0.62 -1.70 0.26 1.44CC01Q15 I practise by saying the material to myself over and over. -0.26 -1.57 0.22 1.35
Figure 61: Item Parameter Estimates for the Index of Memorisation (MEMOR)
How often do these things apply to you? Parameter EstimatesWhen I study, Delta Tau(1) Tau(2) Tau(3)
CC01Q09 I try to relate new material to things I have learned in 0.26 -2.39 0.20 2.20other subjects.
CC01Q17 I figure out how the information might be useful in the 0.32 -2.24 0.23 2.01real world.
CC01Q21 I try to understand the material better by relating it to -0.40 -2.58 0.24 2.35things I already know.
CC01Q25 I figure out how the material fits in with what I have -0.18 -2.77 0.25 2.52already learned.
Figure 62: Item Parameter Estimates for the Index of Elaboration (ELAB)
Con
stru
ctin
g an
d v
alid
atin
g th
e q
ues
tion
aire
in
dic
es
235
Effort and perseveranceThe PISA index of effort and perseverance wasderived from students’ reports on the four itemsin Figure 63. A four-point scale with theresponse categories: almost never, sometimes,often and almost always was used.
Scale scores are standardised Warm estimateswhere positive values indicate higher frequencyand negative values indicate lower frequency ofeffort and perseverance as a learning strategy.
The item parameters used for the weightedlikelihood estimation are given in Figure 63.
Item dimensionality and reliability of‘learning strategies’ scales
Table 80 shows the reliability of each of the fourlearning strategies variables for eachparticipating country. For all four scales, theinternal consistency is satisfactory acrosscountries.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
236
How often do these things apply to you? Parameter EstimatesWhen I study, Delta Tau(1) Tau(2) Tau(3)
CC01Q07 I work as hard as possible. 0.09 -2.58 0.31 2.27
CC01Q12 I keep working even if the material is difficult. 0.33 -2.54 0.22 2.32
CC01Q20 I try to do my best to acquire the knowledge and -0.19 -2.82 0.15 2.68skills taught.
CC01Q28 I put forth my best effort. -0.23 -2.59 0.30 2.29
Figure 63: Item Parameter Estimates for the Index of Effort and Perseverance (EFFPER)
Table 80: Reliability of Learning Strategies Scales
Country CSTRAT MEMOR ELAB EFFPER
Australia 0.81 0.74 0.79 0.81Austria 0.67 0.71 0.73 0.76Belgium (Fl.) 0.69 0.76 0.77 0.78Czech Republic 0.75 0.74 0.79 0.75Denmark 0.70 0.63 0.75 0.76Finland 0.77 0.71 0.79 0.78Germany 0.73 0.73 0.75 0.77Hungary 0.72 0.71 0.78 0.76Iceland 0.72 0.70 0.76 0.77Ireland 0.77 0.75 0.78 0.82Italy 0.74 0.66 0.80 0.78Korea 0.72 0.69 0.76 0.78Luxembourg 0.77 0.77 0.76 0.78Mexico 0.74 0.75 0.77 0.74New Zealand 0.80 0.73 0.76 0.80Norway 0.75 0.77 0.80 0.79Portugal 0.77 0.73 0.74 0.78Sweden 0.76 0.74 0.79 0.80Switzerland 0.74 0.70 0.74 0.79United States 0.83 0.77 0.80 0.83Mean OECD countries 0.76 0.72 0.77 0.78Brazil 0.77 0.67 0.73 0.77Latvia 0.66 0.54 0.63 0.70
Con
stru
ctin
g an
d v
alid
atin
g th
e q
ues
tion
aire
in
dic
es
237
Learning Style
Co-operative learningThe PISA index of co-operative learning wasderived from student reports on the four items inFigure 64. A four-point scale with the responsecategories disagree, disagree somewhat, agreesomewhat and agree was used. For informationon the conceptual underpinning of the index, seeOwens and Barnes (1992).
Scale scores are standardised Warm estimateswhere positive values indicate higher levels ofself-perception of preference for co-operativelearning and negative values lower levels of self-perception of this preference. The itemparameters used for the weighted likelihoodestimation are given in Figure 64.
Competitive learning
The PISA index of competitive learning wasderived from students’ responses to the items inFigure 65, which also gives item parameters usedfor the weighted likelihood estimation. A four-point scale with the response categories disagree,disagree somewhat, agree somewhat and agreewas used. For information on the conceptualunderpinning of the index, see Owens andBarnes (1992).
Scale scores are standardised Warm estimateswhere positive values indicate higher reportedlevels of self-perception of preference forcompetitive learning and negative values indicatelower levels of self-perception of this preference.
How much do you disagree or agree with each of Parameter Estimatesthe following? Delta Tau(1) Tau(2) Tau(3)
CC02Q04 I like to try to be better than other students. 0.19 -1.82 -0.09 1.91
CC02Q11 Trying to be better than others makes me work well. 0.28 -1.82 -0.29 2.10
CC02Q16 I would like to be the best at something. -0.92 -1.10 -0.20 1.30
CC02Q24 I learn faster if I’m trying to do better than the others. 0.45 -1.99 -0.01 2.00
Figure 65: Item Parameter Estimates for the Index of Competitive Learning (COMLRN)
How much do you disagree or agree with each of Parameter Estimatesthe following? Delta Tau(1) Tau(2) Tau(3)
CC02Q02 I like to work with other students. -0.21 -0.85 -0.59 1.44
CC02Q08 I learn most when I work with other students. 0.74 -1.52 -0.11 1.63
CC02Q19 I like to help other people do well in a group. -0.04 -1.29 -0.60 1.89
CC02Q22 It is helpful to put together everyone’s ideas when -0.50 -1.03 -0.59 1.62working on a project.
Figure 64: Item Parameter Estimates for the Index of Co-operative Learning (COPLRN)
Table 80 (cont.)
Country CSTRAT MEMOR ELAB EFFPER
Liechtenstein 0.78 0.72 0.80 0.81Russian Federation 0.71 0.56 0.77 0.76Netherlands 0.70 0.65 0.75 0.77Scotland 0.75 0.68 0.73 0.76
Note: ‘Mean OECD countries’ does not include the Netherlands, which did not meet PISA’s sampling standards. TheFlemish Community of Belgium constitutes half of Belgium’s population and has been given the status of acountry for these analyses.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
238
Item dimensionality and reliability of ‘co-operative and competitive learning’ scales
A CFA showed a poor fit for the initial model,which is largely due to a similar wording ofitems CC02Q02 and CC02Q08. Afterintroducing a correlated error term for theseitems, the model fit was satisfactory for theinternational sample (RMSEA = 0.048, AGFI = 0.98, NNFI = 0.97 and CFI = 0.98).The RMSEA ranged between 0.029 and 0.095across countries, which confirmed the two-dimensional structure for the sub-samples.
Table 81 shows the reliability of the twolearning style variables for each participant. Theinternal consistency for COPLRN is clearlylower than for COMLRN.
Table 81: Reliability of Learning Style Scales
Country COPLRN COMLRN
Australia 0.64 0.77Austria 0.61 0.78Belgium (Fl.) 0.59 0.77Czech Republic 0.64 0.74Denmark 0.60 0.81Finland 0.67 0.80Germany 0.68 0.75Hungary 0.57 0.72Iceland 0.64 0.79Ireland 0.66 0.80Italy 0.67 0.77Korea 0.60 0.76Luxembourg 0.73 0.75Mexico 0.62 0.71New Zealand 0.68 0.79Norway 0.70 0.81Portugal 0.64 0.79Sweden 0.62 0.80Switzerland 0.63 0.77United States 0.77 0.78Mean OECD countries 0.68 0.78Brazil 0.59 0.73Latvia 0.65 0.71Liechtenstein 0.64 0.74Russian Federation 0.66 0.72Netherlands 0.60 0.81Scotland 0.61 0.80
Note: ‘Mean OECD countries’ does not include theNetherlands, which did not meet PISA’s samplingstandards. The Flemish Community of Belgiumconstitutes half of Belgium’s population and hasbeen given the status of a country for theseanalyses.
Self-concept
Self-concept in reading The PISA index of self-concept in reading wasderived from students’ reports on the three itemsin Figure 66. A four-point scale was used withresponse categories disagree, disagree somewhat,agree somewhat and agree. For information onthe conceptual underpinning of the index, seeMarsh, Shavelson and Byrne (1992).
Scale scores are standardised Warm estimateswhere positive values indicate a higher level ofself-concept in reading and negative values alower level of self-concept in reading. The itemparameters used for the weighted likelihoodestimation are given in Figure 66.
Student self-concept in mathematicsThe PISA index of self-concept in mathematicswas derived from students’ responses to the itemsin Figure 67. A four-point scale with the responsecategories disagree, disagree somewhat, agreesomewhat and agree was used. For informationon the conceptual underpinning of the index, seeMarsh, Shavelson and Byrne (1992).
Scale scores are standardised Warm estimateswhere positive values indicate a greater self-concept in mathematics and negative valuesindicate a lower self-concept in mathematics.The item parameters used for the weightedlikelihood estimation are given in Figure 67.
Student academic self-conceptThe PISA index of academic self-concept was
derived from student responses to the items inFigure 68, which gives item parameters used forthe weighted likelihood estimation. A four-pointscale with the response categories disagree,disagree somewhat, agree somewhat and agreewas used. For information on the conceptualunderpinning of the index, see Marsh, Shavelsonand Byrne (1992).
Scale scores are standardised Warm estimateswhere positive values indicate higher levels ofacademic self-concept and negative values, lowerlevels of academic self-concept.
Item dimensionality and reliability of ‘self-concept’ scales
The dimensional structure of the items wasconfirmed by a CFA. The fit was acceptable forthe international sample (RMSEA = 0.073,AGFI = 0.95, NNFI = 0.95 and CFI = 0.97).The correlation between latent factors was 0.09between MATCON and SCVERB, 0.56 betweenMATCON and SCACAD, and 0.61 betweenSCVERB and SCACAD. The RMSEA rangedbetween 0.039 and 0.112 across countries,showing a variation in model fit. There was,however, only one country with a clearly poor fitfor this model.
Table 82 shows the reliabilities of thesecountry indices, which are satisfactory in almostall participating countries the internalconsistency of SCVERB is below 0.60 in onlyone country.
Con
stru
ctin
g an
d v
alid
atin
g th
e q
ues
tion
aire
in
dic
es
239
How much do you disagree or agree with each of Parameter Estimatesthe following? Delta Tau(1) Tau(2) Tau(3)
CC02Q05 I’m hopeless in <test language> classes. (rev.) -0.52 -1.79 0.25 1.54
CC02Q09 I learn things quickly in <test language> class. 0.37 -2.21 -0.28 2.49
CC02Q23 I get good marks in <test language>. 0.15 -2.04 -0.42 2.45Note: Items marked ‘rev.’ had their response categories reversed before scaling.
Figure 66: Item Parameter Estimates for the Index of Self-Concept in Reading (SCVERB)
How much do you disagree or agree with each of Parameter Estimatesthe following? Delta Tau(1) Tau(2) Tau(3)
CC02Q12 I get good marks in mathematics. -0.48 -2.60 -0.36 2.96
CC02Q15 Mathematics is one of my best subjects. 0.36 -2.20 0.01 2.19
CC02Q18 I have always done well in mathematics. 0.12 -2.63 0.00 2.63
Figure 67: Item Parameter Estimates for the Index of Self-Concept in Mathematics (MATCON)
How much do you disagree or agree with each of Parameter Estimatesthe following? Delta Tau(1) Tau(2) Tau(3)
CC02Q03 I learn things quickly in most school subjects. -0.13 -2.79 -0.41 3.20
CC02Q07 I’m good at most school subjects. -0.02 -2.54 -0.39 2.93
CC02Q20 I do well in tests in most school subjects. 0.14 -2.59 -0.34 2.94
Figure 68: Item Parameter Estimates for the Index of Academic Self-Concept (SCACAD)
PIS
A 2
00
0 t
ech
nic
al r
epor
t
240
Learning Confidence
Perceived self-efficacyThe PISA index of perceived self-efficacy wasderived from students’ responses to the three itemsin Figure 69, which gives item parameters used forthe weighted likelihood estimation. A four-pointscale was used with response categories almostnever, sometimes, often and almost always.
Scale scores are standardised Warm estimateswhere positive values indicate a higher sense ofperceived self-efficacy and negative values, alower sense of perceived self-efficacy.
Table 82: Reliability of Self-Concept Scales
Country SCVERB MATCON SCACAD
Australia 0.76 0.85 0.74Austria 0.81 0.88 0.76Belgium (Fl.) 0.72 0.85 0.70Czech Republic 0.74 0.84 0.76Denmark 0.78 0.87 0.80Finland 0.80 0.93 0.84Germany 0.82 0.90 0.78Hungary 0.66 0.87 0.73Iceland 0.77 0.90 0.81Ireland 0.79 0.87 0.77Italy 0.82 0.87 0.74Korea 0.68 0.89 0.78Luxembourg 0.73 0.87 0.74Mexico 0.52 0.84 0.71New Zealand 0.81 0.88 0.79Norway 0.74 0.90 0.85Portugal 0.74 0.87 0.73Sweden 0.75 0.88 0.81Switzerland 0.75 0.88 0.74United States 0.76 0.86 0.79Mean OECD countries 0.75 0.88 0.79Brazil 0.60 0.85 0.73Latvia 0.67 0.85 0.66Liechtenstein 0.73 0.84 0.77Russian Federation 0.67 0.87 0.72Netherlands 0.73 0.89 0.76Scotland 0.81 0.87 0.74
Note: ‘Mean OECD countries’ does not include the Netherlands, which did notmeet PISA’s sampling standards. The Flemish Community of Belgiumconstitutes half of Belgium’s population and has been given the status of acountry for these analyses.
Control expectationThe PISA index of control expectation wasderived from students’ responses to the fouritems in Figure 70, which gives item parametersused for the weighted likelihood estimation. Afour-point scale was used with responsecategories almost never, sometimes, often andalmost always.
Scale scores are standardised Warm estimateswhere positive values indicate higher controlexpectation and negative values indicate lowercontrol expectation.
Item dimensionality and reliability of‘learning confidence’ scales
The CFA for the two-dimensional model showedan acceptable model fit for the internationalsample (RMSEA = 0.069, AGFI = 0.96, NNFI =0.95 and CFI = 0.97). The correlation betweenthe latent factors was as high as 0.93 but forconceptual reasons it was decided to keep twoseparate scales. The RMSEA for the countrysub-samples ranged between 0.046 and 0.111.
Table 83 shows the reliability of each of thetwo learning confidence scales for eachparticipating country. Internal consistency isgood for both indices across countries.
Con
stru
ctin
g an
d v
alid
atin
g th
e q
ues
tion
aire
in
dic
es
241
How often do these things apply to you? Parameter EstimatesDelta Tau(1) Tau(2) Tau(3)
CC01Q02 I’m certain I can understand the most difficult material 0.42 -2.77 0.45 2.32presented in texts.
CC01Q18 I’m confident I can do an excellent job on assignments -0.26 -2.66 0.29 2.38and tests.
CC01Q26 I’m certain I can master the skills being taught. -0.16 -2.86 0.26 2.6
Figure 69: Item Parameter Estimates for the Index of Perceived Self-Efficacy (SELFEF)
How often do these things apply to you? Parameter EstimatesDelta Tau(1) Tau(2) Tau(3)
CC01Q04 When I sit myself down to learn something really -0.13 -2.25 0.34 1.90difficult, I can learn it.
CC01Q11 If I decide not to get any bad grades, I can really do it. -0.09 -2.10 0.30 1.81
CC01Q16 If I decide not to get any problems wrong, I can really do it. 0.80 -2.44 0.35 2.09
CC01Q24 If I want to learn something well, I can. -0.58 -2.49 0.29 2.20
Figure 70: Item Parameter Estimates for the Index of Control Expectation (CEXP)
PIS
A 2
00
0 t
ech
nic
al r
epor
t
242
Table 83: Reliability of Learning ConfidenceScales
Country SELFEF CEXP
Australia 0.73 0.80Austria 0.63 0.70Belgium (Fl.) 0.64 0.72Czech Republic 0.66 0.73Denmark 0.72 0.75Finland 0.77 0.81Germany 0.67 0.71Hungary 0.71 0.72Iceland 0.79 0.78Ireland 0.74 0.78Italy 0.67 0.71Korea 0.68 0.75Luxembourg 0.67 0.74Mexico 0.67 0.76New Zealand 0.70 0.79Norway 0.75 0.82Portugal 0.69 0.75Sweden 0.75 0.79Switzerland 0.61 0.72United States 0.77 0.79Mean OECD countries 0.70 0.75Brazil 0.66 0.72Latvia 0.57 0.68Liechtenstein 0.69 0.74Russian Federation 0.67 0.70Netherlands 0.63 0.70Scotland 0.71 0.74
Note: ‘Mean OECD countries’ does not include theNetherlands, which did not meet PISA’s samplingstandards. The Flemish Community of Belgiumconstitutes half of Belgium’s population and hasbeen given the status of a country for theseanalyses.
Computer familiarity or informationtechnology
The following indices were derived from theinternational option on Computer Familiarity,which was administered in almost two-thirds ofthe countries.
Comfort with and perceived ability to usecomputers
The PISA index of comfort with and perceivedability to use computers was derived fromstudents’ responses to the following questions:How comfortable are you using a computer?How comfortable are you using a computer towrite a paper? How comfortable are you takinga test on a computer?, and If you compareyourself with other 15-year-olds, how would yourate your ability to use a computer? For the firstthree questions, a four-point scale was used withthe response categories very comfortable,comfortable, somewhat comfortable and not at allcomfortable. For the last question, a four-pointscale was used with the response categoriesexcellent, good, fair and poor. For informationon the conceptual underpinning of the index, seeEignor, Taylor, Kirsch and Jamieson (1998).
Scale scores are standardised Warm estimateswhere positive values indicate a higher self-per-ception of computer abilities and negative valuesindicate a lower self-perception of computer abilities.The item parameters used for the weightedlikelihood estimation are given in Figure 71.
Computer usageThe PISA index of computer usage was derivedfrom students’ responses to the six questions inFigure 72, which gives the item parameters usedfor the weighted likelihood estimation. A five-point scale with the response categories almostevery day, a few times each week, between once
How comfortable: Parameter EstimatesDelta Tau(1) Tau(2) Tau(3)
IT02Q01 are you with using a computer? -0.64 -1.71 0.01 1.70IT02Q02 are you with using a computer to write a paper? -0.26 -1.38 -0.13 1.51IT02Q03 would you be taking a test on a computer? 0.50 -1.32 -0.11 1.43IT03Q01 If you compare yourself with other 15-year-olds, how 0.40 -2.52 -0.15 2.67
would you rate your ability to use a computer?Note: All items were reversed for scaling.
Figure 71: Item Parameter Estimates for the Index of Comfort With and Perceived Ability to Use Computers (COMAB)
a week and once a month, less than once amonth and never was used. For information onthe conceptual underpinning of the index seeEignor, Taylor, Kirsch and Jamieson (1998).
Scale scores are standardised Warm estimateswhere positive values indicate higher frequencyand negative values indicate lower frequency ofcomputer use.
Interest in computersThe PISA index of interest in computers wasderived from the students’ responses to the fouritems in Figure 73, which gives item parametersused for the weighted likelihood estimation. A two-point scale with the response categoriesyes and no was used.
Scale scores are standardised Warm estimateswhere positive values indicate a more positiveattitude towards computers and negative values,a less positive attitude towards computers.
Item dimensionality and reliability ofIT-related scales
The three-dimensional model was confirmed bya CFA for the international sample (RMSEA =0.070, AGFI = 0.92, NNFI = 0.91 and CFI =0.91). The correlation between latent factors was0.60 for COMUSE and COMAB, 0.59 forCOMATT and COMAB, and 0.50 forCOMATT and COMUSE. The RMSEA for thecountry sub-samples ranged between 0.049 and0.095.
Table 84 shows the reliability for each of thethree IT variables for each participant. Whereasinternal consistency for COMUSE and COMABis very high, COMATT has a somewhat lowerbut still satisfactory reliability in view of the factthat it consists of only four dichotomous items.
243
How often do you use: Parameter EstimatesDelta Tau(1) Tau(2) Tau(3) Tau(4)
IT05Q03 the computer to help you learn school material? -0.17 -0.92 -0.75 0.23 1.44
IT05Q04 the computer for programming? 0.35 -0.27 -0.47 -0.06 0.80
How often do you use each of the following kinds of computer software?IT06Q02 Word processing (e.g., Word® or Word Perfect®) -0.71 -0.70 -1.18 0.24 1.64
IT06Q03 Spreadsheets (e.g., Lotus® or Excel®) 0.23 -1.01 -0.60 0.17 1.44
IT06Q04 Drawing, painting or graphics -0.10 -1.12 -0.38 0.22 1.28
IT06Q05 Educational software 0.40 -1.06 -0.48 0.21 1.33
Note: All items were reversed for scaling.
Figure 72: Item Parameter Estimates for the Index of Computer Usage (COMUSE)
Parameter EstimatesDelta
IT07Q01 It is very important to me to work with a computer. 0.53
IT08Q01 To play or work with a computer is really fun. -1.28
IT09Q01 I use a computer because I am very interested in this. 0.11
IT10Q01 I forget the time, when I am working with the computer. 0.64
Note: All items were reversed for scaling.
Figure 73: Item Parameter Estimates for the Index of Interest in Computers (COMATT)
Con
stru
ctin
g an
d v
alid
atin
g th
e q
ues
tion
aire
in
dic
es
PIS
A 2
00
0 t
ech
nic
al r
epor
t
244
School typeA school was classified as either public orprivate according to whether a public agency ora private entity had the ultimate decision-makingpower concerning its affairs. A school wasclassified as public if the principal reported thatit was managed directly or indirectly by a publiceducation authority, government agency, or by agoverning board appointed by government orelected by public franchise. A school wasclassified as private if the principal reported thatit was managed directly or indirectly by a non-government organisation (e.g., a church, a tradeunion, business or another private institution).
A distinction was made between government-dependent and independent private schoolsaccording to the degree of dependence ongovernment funding. School principals wereasked to specify the percentage of the school’stotal funding received in a typical school yearfrom: government sources; student fees or schoolcharges paid by parents; benefactors, donations,bequests, sponsorships or parental fund-raising;and other sources. Schools were classified asgovernment-dependent private if they received50 per cent or more of their core funding fromgovernment agencies and independent private ifthey received less than 50 per cent of their corefunding from government agencies.• SCHLTYPE is based on the question
SC03Q01 and SC04Q01 to SC04Q04. Thisvariable has the following categories:1=Private, government-independent;2=Private, government-dependent;3=Government.
School sizeSCHLSIZE is the sum of questions SC02Q01and SC02Q02, which ask for the total numberof boys and the total number of girls enrolled asof March 31, 2000.
Percentage of girlsPCGIRLS is the ratio of the number of girls tothe total enrolment (SCHLSIZE).
Table 84: Reliability of Computer Familiarity orInformation Technology Scales
Country COMAB COMUSE COMATT
Australia 0.84 0.81 0.69Belgium (Fl.) 0.85 0.80 0.65Canada 0.83 0.84 0.68Czech Republic 0.85 0.79 0.60Denmark 0.81 0.82 0.73Finland 0.85 0.83 0.69Germany 0.86 0.83 0.68Hungary 0.72 0.76 0.69Ireland 0.86 0.82 0.59Luxembourg 0.87 0.85 0.71Mexico 0.80 0.83 0.57New Zealand 0.85 0.83 0.67Norway - 0.80 -Sweden 0.79 0.83 0.68Switzerland 0.86 0.82 0.73United States 0.79 0.80 0.69Mean OECD coun. 0.83 0.82 0.68Brazil 0.87 0.81 0.51Latvia 0.80 0.80 0.65Liechtenstein 0.83 0.83 0.72Russian Federation 0.83 0.83 0.76Scotland 0.86 0.79 0.57
Note: In Norway, only one item was included for the scaleCOMAB; no items were included for COMATT. The Flemish Community of Belgium constitutes halfof Belgium’s population and has been given the statusof a country for these analyses.
Basic School Characteristics
The indices in this section were derived from theSchool Questionnaire.
Hours of schoolingThe PISA index of hours of schooling per year was basedon information provided by principals on: the numberof instructional weeks in the school year; the numberof <class periods> in the school week; and thenumber of teaching minutes in the average single<class period>. The index was derived from theproduct of these three factors, divided by 60, andhas the variable label TOTHRS.
School Policies and Practices
School and teacher autonomySchool principals were asked to report whetherteachers, department heads, the school principal,an appointed or elected board, or higher leveleducation authorities were primarily responsiblefor: appointing and dismissing teachers;establishing teachers’ starting salaries anddetermining their increases; formulating andallocating school budgets; establishing studentdisciplinary and student assessment policies;approving students for admission; choosingtextbooks; determining course content; anddeciding which courses were offered.
The PISA index of school autonomy wasderived from the number of categories thatprincipals classified as not being a schoolresponsibility (which were inverted before scaling).
Scale scores are standardised Warm estimateswhere positive values indicate higher levels andnegative values indicate lower levels of schoolautonomy. The item parameters used for theweighted likelihood estimation are given inFigure 74.
The PISA index of teacher autonomy wasderived from the number of categories thatprincipals identified as being mainly theresponsibility of teachers.
Scale scores are standardised Warm estimateswhere positive values indicate higher levels andnegative values indicate lower levels of teacherparticipation in school decisions. The itemparameters used for the weighted likelihoodestimation are given in Figure 75.
245
In your school, who has the main responsibility for: Parameter Estimates(not a school responsibility) Delta
SC22Q01 hiring teachers? -0.87SC22Q02 firing teachers? -1.49SC22Q03 establishing teachers’ starting salaries? -3.65SC22Q04 determining teachers’ salary increases? -3.49SC22Q05 formulating the school budget? -0.30SC22Q06 deciding on budget allocations within the school? 2.00SC22Q07 establishing student disciplinary policies? 3.67SC22Q08 establishing student assessment policies? 1.91SC22Q09 approving students for admittance to school? 0.53SC22Q10 choosing which textbooks are used? 2.09SC22Q11 determining course content? -0.32SC22Q12 deciding which courses are offered? -0.09Note: All items were reversed for scaling.
Figure 74: Item Parameter Estimates for the Index of School Autonomy (SCHAUTON)
In your school, who is primarily responsible for: (teachers) Parameter Estimates Delta
SC22Q01 hiring teachers? 1.68SC22Q02 firing teachers? 3.98SC22Q03 establishing teachers’ starting salaries? 4.27SC22Q04 determining teachers’ salary increases? 3.48SC22Q05 formulating the school budget? 1.16SC22Q06 deciding on budget allocations within the school? -0.16SC22Q07 establishing student disciplinary policies? -2.85SC22Q08 establishing student assessment policies? -3.31SC22Q09 approving students for admittance to school? 0.64SC22Q10 choosing which textbooks are used? -4.06SC22Q11 determining course content? -3.12SC22Q12 deciding which courses are offered? -1.73Note: All items were reversed for scaling.
Figure 75: Item Parameter Estimates for the Index of Teacher Autonomy (TCHPARTI) Con
stru
ctin
g an
d v
alid
atin
g th
e q
ues
tion
aire
in
dic
es
PIS
A 2
00
0 t
ech
nic
al r
epor
t
246
Item dimensionality and reliability of ‘schooland teacher autonomy’ scales
A comparative CFA based on covariance matricescould not be completed for this index because inseveral countries there was no variance for someof the items used. This is plausible given that,depending on the educational system, schoolsand/or teachers never decide certain matters.
Table 85 shows the reliability for each schooland teacher autonomy scale for eachparticipating country. Not surprisingly, thisvaries considerably across countries.
Table 85: Reliability of School Policies andPractices Scales
Country SCHAUTON TCHPARTI
Australia 0.75 0.80Austria 0.58 0.61Belgium 0.61 0.59Canada 0.75 0.71Czech Republic 0.78 0.66Denmark 0.62 0.78Finland 0.67 0.65France 0.78 0.63Germany 0.81 0.75Greece 0.87 0.49Hungary 0.60 0.78Iceland 0.64 0.62Ireland 0.60 0.57Italy 0.57 0.67Japan 0.82 0.82Korea 0.61 0.62Mexico 0.83 0.61New Zealand 0.11 0.74Portugal 0.48 0.50Spain 0.77 0.72Sweden 0.69 0.76Switzerland 0.74 0.68United Kingdom 0.71 0.74United States 0.78 0.82Mean OECD countries 0.77 0.73Brazil 0.82 0.66Latvia 0.64 0.63Liechtenstein 0.88 0.82Russian Federation 0.61 0.38
Netherlands 0.41 0.77
Note: ‘Mean OECD countries’ does not include theNetherlands, which did not meet PISA’s samplingstandards. No results are available for Norwayor Poland, and schools in Luxembourg allprovided the same answers.
School Climate
School principals’ perceptions of teacher-related factors affecting school climate
The PISA index of the principals’ perceptions ofteacher-related factors affecting school climatewas derived from principals’ reports on theextent to which 15-year-olds were hindered intheir learning by: the low expectations ofteachers; poor student-teacher relations; teachersnot meeting individual students’ needs; teacherabsenteeism; staff resisting change; teachers beingtoo strict with students; and students not beingencouraged to achieve their full potential. Thequestions asked are shown in Figure 76. A four-point scale with the response categories not at all,very little, to some extent and a lot was used, andthe responses were inverted before scaling.
Scale scores are standardised Warm estimates,where positive values indicate that teacher-related factors do not hinder the learning of 15-year-olds, whereas negative values indicateschools in which there is a perception thatteacher-related factors hinder the learning of 15-year-olds. The item parameters used for theweighted likelihood estimation are given inFigure 76.
School principals’ perceptions of student-related factors affecting school climate
The PISA index of the principals’ perceptionsof student-related factors affecting school climatewas derived from principals’ reports on the extentto which 15-year-olds were hindered in theirlearning by: student absenteeism; disruption ofclasses by students; students skipping classes;students lacking respect for teachers; the use ofalcohol or illegal drugs; and students intimidatingor bullying other students. A four-point scale withthe response categories not at all, very little, tosome extent and a lot was used, and the responseswere inverted before scaling.
Scale scores are standardised Warm estimates,where positive values indicate that student-relatedfactors do not hinder the learning of 15-year-olds,whereas negative values indicate schools in whichthere is a perception that student-related factorshinder the learning of 15-year-olds. The itemparameters used for the weighted likelihoodestimation are given in Figure 77.
School principals’ perception of teachers’morale and commitment
The PISA index of principals’ perception ofteachers’ morale and commitment was derivedfrom the extent to which school principalsagreed with the statements in Figure 78, whichgives the item parameters used for the weighted
likelihood estimation. A four-point scale with theresponse categories strongly disagree, disagree,agree and strongly agree was used.
Scale scores are standardised Warm estimates,where positive values indicate a higherperception of teacher morale, and negativevalues a lower perception of teacher morale.
In your school, is the learning of <15-year-old students> Parameter Estimateshindered by: Delta Tau(1) Tau(2) Tau(3)
SC19Q01 low expectations of teachers? 0.28 -2.15 -0.36 2.50
SC19Q03 poor student-teacher relations? 0.00 -2.45 0.77 1.68
SC19Q07 teachers not meeting individual students’ needs? -0.57 -2.81 -0.07 2.88
SC19Q08 teacher absenteeism? -0.05 -2.50 0.31 2.18
SC19Q11 staff resisting change? -0.21 -2.40 -0.23 2.63
SC19Q14 teachers being too strict with students? 0.85 -2.63 0.66 1.97
SC19Q16 students not being encouraged to achieve their full -0.30 -2.07 -0.05 2.13
potential?Note: All items were reversed for scaling.
Figure 76: Item Parameter Estimates for the Index of Principals’ Perceptions of Teacher-Related Factors AffectingSchool Climate (TEACBEHA)
In your school, is the learning of <15-year-old students> Parameter Estimateshindered by: Delta Tau(1) Tau(2) Tau(3)
SC19Q02 student absenteeism? -1.02 -2.25 -0.07 2.31
SC19Q06 disruption of classes by students? -0.62 -2.86 0.04 2.83
SC19Q09 students skipping classes? -0.29 -2.24 0.1 2.14
SC19Q10 students lacking respect for teachers? 0.09 -2.8 0.19 2.61
SC19Q13 the use of alcohol or illegal drugs? 1.02 -1.68 0.82 0.86
SC19Q15 students intimidating or bullying other students? 0.82 -2.76 0.39 2.37
Note: All items were reversed for scaling.
Figure 77: Item Parameter Estimates for the Index of Principals’ Perceptions of Student-Related Factors AffectingSchool Climate (STUDBEHA)
Think about the teachers in your school. How much do Parameter Estimatesyou agree or disagree with the following statements? Delta Tau(1) Tau(2) Tau(3)
SC20Q01 The morale of teachers in this school is high. 0.55 -2.45 -0.62 3.07
SC20Q02 Teachers work with enthusiasm. 0.07 -2.64 -1.05 3.70
SC20Q03 Teachers take pride in this school. 0.05 -2.32 -1.09 3.41
SC20Q04 Teachers value academic achievement. -0.67 -1.16 -1.70 2.86
Figure 78: Item Parameter Estimates for the Index of Principals’ Perceptions of Teachers’ Morale and Commitment(TCMORALE)
Con
stru
ctin
g an
d v
alid
atin
g th
e q
ues
tion
aire
in
dic
es
247
Item dimensionality and reliability of ‘schoolclimate’ scales
The model fit in a CFA was satisfactory for theinternational sample (RMSEA = 0.071, AGFI = 0.91, NNFI = 0.89 and CFI = 0.91).The correlation between latent factors was 0.77for TEACBEHA and STUDBEHA, 0.37 forTCMORALE and TEACBEHA, and 0.23 forTCMORALE and STUDBEHA. For the countrysamples, the RMSEA ranged between 0.043 and0.107.
Table 86 shows that the reliability for each ofthe school climate variables is high acrosscountries.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
248
Table 86: Reliability of School Climate Scales
Country TEACBEHA STUDBEHA TCMORALE
Australia 0.86 0.86 0.74Austria 0.73 0.75 0.69Belgium 0.81 0.88 0.73Canada 0.83 0.81 0.78Czech Republic 0.78 0.76 0.70Denmark 0.70 0.82 0.73Finland 0.76 0.66 0.66France 0.73 0.80 0.79Germany 0.77 0.78 0.79Greece 0.90 0.83 0.84Hungary 0.80 0.83 0.71Iceland 0.78 0.77 0.87Ireland 0.81 0.81 0.81Italy 0.87 0.81 0.80Japan 0.82 0.85 0.89Korea 0.82 0.78 0.87Luxembourg 0.78 0.77 0.67Mexico 0.89 0.88 0.87New Zealand 0.82 0.78 0.84Norway 0.78 0.76 0.76Poland 0.73 0.82 0.59Portugal 0.75 0.81 0.80Spain 0.83 0.83 0.64Sweden 0.76 0.74 0.75Switzerland 0.77 0.83 0.74United Kingdom 0.90 0.86 0.81United States 0.86 0.74 0.88Mean OECD countries 0.83 0.81 0.79Brazil 0.85 0.84 0.83Latvia 0.79 0.73 0.71Liechtenstein 0.78 0.89 0.52Russian Federation 0.79 0.80 0.74
Netherlands 0.79 0.85 0.60
Note: ‘Mean OECD countries’ does not include the Netherlands, which did not meetPISA’s sampling standards.
School Resources
Quality of the schools’ educational resourcesThe PISA index of the quality of a school’seducational resources was based on the schoolprincipals’ reports on the extent to which 15-year-olds were hindered in their learning by alack of resources. The questions asked areshown in Figure 79, which gives item parametersused for the weighted likelihood estimation. Afour-point scale with the response categories notat all, very little, to some extent and a lot wasused and the responses were inverted beforescaling.
Scale scores are standardised Warm estimates,where positive values indicate that the learningof 15-year-olds was not hindered by the(un)availability of educational resources.Negative values indicate the perception of alower quality of educational materials at school.
Quality of the schools’ physical infrastructureThe PISA index of the quality of a school’sphysical infrastructure was derived fromprincipals’ reports on the extent to whichlearning by 15-year-olds in their school washindered by: poor condition of buildings; poorheating, cooling and/or lighting systems; and
lack of instructional space (e.g., classrooms). Thequestions are shown in Figure 80, which givesthe item parameters used for the weightedlikelihood estimation. A four-point scale with theresponse categories not at all, very little, to someextent and a lot was used and responses werereversed before scaling.
Scale scores are standardised Warm estimates,where positive values indicate that the learningof 15-year-olds was not hindered by the school’sphysical infrastructure, and negative valuesindicate the perception that the learning of 15-year-olds was hindered by the school’s physicalinfrastructure.
Teacher shortage The PISA index of teacher shortage was derivedfrom the principals’ view on how much 15-year-old students were hindered in their learning bythe shortage or inadequacy of teachers in the<test language>, mathematics or science. Theitems are given in Figure 81, which shows theitem parameters used for the weighted likelihoodestimation. A four-point scale with the responsecategories not at all, a little, somewhat and a lotwas used, and items were reversed beforescaling.
249
In your school, how much is the learning of Parameter Estimates<15-year-old students> hindered by: Delta Tau(1) Tau(2) Tau(3)
SC11Q04 lack of instructional material (e.g., textbooks)? 0.76 -1.41 0.06 1.35
SC11Q05 not enough computers for instruction? -0.30 -1.50 -0.18 1.67
SC11Q06 lack of instructional materials in the library? 0.10 -1.66 -0.11 1.76
SC11Q07 lack of multi-media resources for instruction? -0.39 -1.79 -0.19 1.99
SC11Q08 inadequate science laboratory equipment? -0.07 -1.51 -0.08 1.59
SC11Q09 inadequate facilities for the fine arts? -0.09 -1.45 -0.07 1.51
Note: All items were reversed for scaling.
Figure 79: Item Parameter Estimates for the Index of the Quality of the Schools’ Educational Resources (SCMATEDU)
In your school, how much is the learning of Parameter Estimates<15-year-old students> hindered by: Delta Tau(1) Tau(2) Tau(3)
SC11Q01 poor condition of buildings? 0.09 -1.27 -0.47 1.75SC11Q02 poor heating, cooling and/or lighting systems? 0.21 -1.36 -0.34 1.70SC11Q03 lack of instructional space (e.g., classrooms)? -0.30 -1.18 -0.34 1.53Note: All items were reversed for scaling.
Figure 80: Item Parameter Estimates for the Index of the Quality of the Schools’ Physical Infrastucture (SCMATBUI)
Con
stru
ctin
g an
d v
alid
atin
g th
e q
ues
tion
aire
in
dic
es
Scale scores are standardised Warm estimates,where positive values indicate that the learningof 15-year-olds was not hindered by teachershortage or inadequacy. Negative values indicatethe perception that 15-year-olds were hinderedin their learning by teacher shortage orinadequacy.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
250
Item dimensionality and reliability of ‘schoolresources’ scales
A CFA showed an acceptable model fit for theinternational sample (RMSEA = 0.077, AGFI = 0.91, NNFI = 0.93 and CFI = 0.94).The correlation between latent factors was 0.67for SCMATEDU and SCMATBUI, 0.32 forTCSHORT and SCMATBUI, and 0.36 forTCSHORT and SCMATEDU. The RMSEA forthe country samples ranged between 0.054 and0.139.
Table 87 shows the reliability for each of theschool resources scales (apart from computers,which are covered in the next section) for eachparticipating country. Internal consistency issatisfactory across countries for all three scales(with the sole exception of SCMATBUI forLiechtenstein, where the number of schools wasvery small).
In your school, is the learning of <15-year-old students> Parameter Estimateshindered by: Delta Tau(1) Tau(2) Tau(3)
SC21Q01 a shortage/inadequacy of teachers -0.51 -2.25 0.16 2.08
SC21Q02 a shortage/inadequacy of <test language> teachers? 0.37 -1.64 0.18 1.46
SC21Q03 a shortage/inadequacy of <mathematics> teachers? 0.15 -1.68 0.09 1.60
SC21Q04 a shortage/inadequacy of <science> teachers? -0.01 -1.61 0.02 1.59
Note: All items were reversed for scaling.
Figure 81: Item Parameter Estimates for the Index of Teacher Shortage (TCSHORT)
Table 87: Reliability of School Resources Scales
Country SCMATEDU SCMATBUI TCSHORT
Australia 0.89 0.75 0.82Austria 0.77 0.77 0.67Belgium 0.86 0.73 0.83Canada 0.87 0.80 0.87Czech Republic 0.83 0.65 0.65Denmark 0.79 0.70 0.76Finland 0.84 0.82 0.72France 0.83 0.72 0.88Germany 0.86 0.78 0.84Greece 0.84 0.91 0.95Hungary 0.79 0.59 0.87Iceland 0.80 0.71 0.83Ireland 0.84 0.75 0.78Italy 0.81 0.72 0.89Japan 0.84 0.70 0.93
Availability of computers
School principals provided information on thetotal number of computers available in theirschools and more specifically on the number ofcomputers: available to 15-year-olds; availableonly to teachers; available only to administrativestaff; connected to the Internet; and connected toa local area network. Six indices of computeravailability were prepared and included in thedatabase:• RATCOMP, which is the total number of
computers in the school divided by the schoolenrolment size;
• PERCOMP1, which is the number ofcomputers available for 15-year-olds(SC13Q02) divided by the total number ofcomputers (SC13Q01);
• PERCOMP2, which is the number ofcomputers available for teachers only(SC13Q03) divided by the total number ofcomputers (SC13Q01);
• PERCOMP3, which is the number ofcomputers available only for administrativestaff (SC13Q04) divided by the total numberof computers (SC13Q01);
• PERCOMP4, which is the number ofcomputers connected to the Web (SC13Q05)divided by the total number of computers(SC13Q01); and
• PERCOMP5, which is the number ofcomputers connected to a local area network(SC13Q06) divided by the total number ofcomputers (SC13Q01).
Teacher qualificationsSchool principals indicated the number of fulland part-time teachers employed in their schoolsand the numbers of: <test language> teachers,mathematics teachers and science teachers;teachers fully certified by the <appropriatenational authority>; teachers with a qualificationof <ISCED level 5A> in <pedagogy>, or <ISCEDlevel 5A> in the <test language>, or <ISCEDlevel 5A> in <mathematics>, or <ISCED level5A> in <science>. Five variables wereconstructed:• PROPQUAL, which is the proportion of
teachers with an ISCED 5A qualification inpedagogy (SC14Q02) divided by the totalnumber of teachers (SC14Q01);
Con
stru
ctin
g an
d v
alid
atin
g th
e q
ues
tion
aire
in
dic
es
251
Table 87 (cont.)
Country SCMATEDU SCMATBUI TCSHORT
Korea 0.86 0.75 0.93Luxembourg 0.75 0.63 0.80Mexico 0.88 0.81 0.91New Zealand 0.87 0.82 0.90Norway 0.86 0.77 0.88Poland 0.84 0.79 0.88Portugal 0.85 0.68 0.87Spain 0.85 0.73 0.85Sweden 0.83 0.79 0.87Switzerland 0.79 0.68 0.74United Kingdom 0.91 0.90 0.89United States 0.77 0.73 0.85Mean OECD countries 0.85 0.79 0.88Brazil 0.87 0.83 0.88Latvia 0.79 0.70 0.62Liechtenstein 0.45 0.17 0.59Russian Federation 0.78 0.77 0.85
Netherlands 0.89 0.80 0.82
Note: ‘Mean OECD countries’ does not include the Netherlands, which did not meetPISA’s sampling standards.
• PROPCERT, which is the proportion ofteachers fully certified by the appropriateauthority (SC14Q03) divided by the totalnumber of teachers (SC14Q01);
• PROPREAD, which is the proportion of testlanguage teachers with an ISCED 5Aqualification (SC14Q05) divided by thenumber of test language teachers (SC14Q04);
• PROPMATH, which is the proportion ofmathematics teachers with an ISCED 5Aqualification (SC14Q07) divided by thenumber of mathematics teachers (SC14Q06);and
• PROPSCIE, which is the proportion ofscience teachers who have an ISCED 5Aqualification (SC14Q09) divided by the totalnumber of science teachers (SC14Q08).
Student/teaching staff ratioThe student/teaching staff ratio (STRATIO) wasdefined as the number of full-time equivalentteachers divided by the number of students inthe school. To convert head counts into full-timeequivalents, a full-time teacher employed for atleast 90 per cent of the statutory time as aclassroom teacher received a weight of 1, and apart-time teacher employed for less than 90 percent of the time as a classroom teacher receiveda weight of 0.5.
Class sizeAn estimate of class size was obtained fromstudents’ reports on the number of students intheir respective <test language>, mathematicsand science classes.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
252
International Database
Christian Monseur
Files in the Database
The PISA international database consists offive data files: four student-level files and oneschool-level file. All are provided in text (orASCII format) with the corresponding SAS® andSPSS® control files.
The student files
Student and reading performance data files(filename: intstud_read.txt)
For each student who participated in theassessment, the following information is available:• Identification variables for the country, school
and student;• The student responses on the three
questionnaires, i.e., the student questionnaireand the two international options: ComputerFamiliarity or Information Technologyquestionnaire (IT) and Cross-CurriculumCompetencies questionnaire (CCC);
• The students’ indices derived from the originalquestions in the questionnaires;
• The students’ performance scores in reading;• The student weights and a country adjustment
factor for the reading weights; and• The 80 reading Fay’s replicates for the compu-
tation of the sampling variance estimates.
Student and mathematics performance datafiles (filename: intstud_math.txt)
For each student who was assessed with oneof the booklets containing mathematics material,the following information is available:
• Identification variables for the country, schooland student;
• The students’ responses on the threequestionnaires, i.e., the student questionnaireand the two international options: ComputerFamiliarity or Information Technologyquestionnaire (IT) and Cross-CurriculumCompetencies questionnaire (CCC);
• The students’ indices derived from the originalquestions in the questionnaires;
• The students’ performance scores in readingand mathematics;
• The student weights and a country adjustmentfactor for the mathematics weights; and
• The 80 reading Fay’s replicates for the compu-tation of the sampling variance estimates.
Student and science performance data files(filename: intstud_scie.txt)
For each student who was assessed with oneof the booklets that contain science material, thefollowing information is available:• Identification variables for the country, school
and student;• The student responses on the three
questionnaires, i.e., the student questionnaireand the two international options: ComputerFamiliarity or Information Technologyquestionnaire (IT) and Cross-CurriculumCompetencies questionnaire (CCC);
• The students’ indices derived from the originalquestions in the questionnaires;
• The students’ performance scores in readingand science;
• The student weights and a country adjustmentfactor for the science weights; and
Chapter 18
• The 80 reading Fay’s replicates for thecomputation of the sampling varianceestimates.
The school file
The school questionnaire data file (filename:intscho.txt)
For each school that participated in theassessment, the following information isavailable:• The identification variables for the country
and school;• The school responses on the school
questionnaire; and• The school indices derived from the original
questions in the school questionnaire.• The school weight.
The assessment items data file
The cognitive file (filename: intcogn.txt)For each item included in the test, this file
shows the students' responses expressed in aone-digit format. The items from mathematicsand science used double-digit coding duringmarking1. A file including these codes wasavailable to national centres.
Records in the database
Records included in the database
Student level• All PISA students who attended one of the two
test (assessment) sessions.• PISA students who only attended the
questionnaire session are included if theyprovided a response to the father’s occupation
questions or the mother’s occupationquestions on the student questionnaire(questions 8 to 11).
School level• All participating schools—that is, any school
where at least 25 per cent of the sampledeligible students were assessed— have arecord in the school-level internationaldatabase, regardless of whether the schoolreturned the school questionnaire.
Records excluded from the database
Student level• Additional data collected by some countries
for a national or international option such asa grade sample.
• Sampled students who were reported as noteligible, students who were no longer atschool, students who were excluded forphysical, mental or linguistic reasons, andstudents who were absent on the testing day.
• Students who refused to participate in theassessment sessions.
• Students from schools where less than 25 per centof the sampled and eligible students participated.
School level• Schools where fewer than 25 per cent of the
sampled eligible students participated in thetesting sessions.
Representing missing data
The coding of the data distinguishes betweenfour different types of missing data:
• Item level non-response: 9 for a one-digit variable,99 for a two-digit variable, 999 for a three-digitvariable, and so on. Missing codes are shownin the codebooks. This missing code is used ifthe student or school principal was expectedto answer a question, but no response wasactually provided.
• Multiple or invalid responses: 8 for a one-digitvariable, 98 for a two-digit variable, 998 for athree-digit variable, and so on. This code isused for multiple choice items in both testbooklets and questionnaires where an invalidresponse was provided. This code is not usedfor open-ended questions.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
254
1 The responses from open-ended items could givevaluable information about students' ideas and thinking,which could be fed back into curriculum planning. Forthis reason, the marking guides for these items in mathe-matics and science were designed to include a two-digitmarking so that the frequency of various types of correctand incorrect response could be recorded. The first digitwas the actual score. The second digit was used tocategorise the different kings of response on the basisof the strategies used by the student to answer the item.The international database includes only the first digit.
• Not applicable: 7 for a one-digit variable, 97for a two-digit variables, 997 for a three-digitvariable, and so on for the studentquestionnaire data file and for the school datafile. Code “n” is used for a one-digit variablein the test booklet data file. This code is usedwhen it was not possible for the student toanswer the question. For instance, this code isused if a question was misprinted or if aquestion was deleted from the questionnaireby a national centre. The “not-applicable”codes and code “n” are also used in the testbooklet file for questions that were notincluded in the test booklet that the studentreceived.
• Not reached items: all consecutive missingvalues starting from the end of each testsession were replaced by the non-reachedcode, “r”, except for the first value of themissing series, which is coded as missing.
How are students andschools identified?
The student identification from the studentfiles consists of three variables, which togetherform a unique identifier for each student:• The country identification variable labelled
COUNTRY. The country codes used in PISAare the ISO 3166 country codes.
• The school identification variable labelledSCHOOLID.
• The student identification variable labelledSTIDSTD.A fourth variable has been included to
differentiate sub-national entities withincountries. This variable (SUBNATIO) is used forfour countries as follows:• Australia. The eight values “01”, “02”, “03”,
“04”, “05”, “06”, “07”, “08” are assigned tothe Australian Capital Territory, New SouthWales, Victoria, Queensland, South Australia,Western Australia, Tasmania and theNorthern Territory respectively.
• Belgium. The value “01” is assigned to theFrench Community and the value “02” isassigned to the Flemish Community.
• Switzerland. The value “01” is assigned to theGerman-speaking community, the value “02”is assigned to the French-speaking communityand the value “03” is assigned to the Italian-speaking community.
• United Kingdom. The value “01” is assignedto Scotland, the value “02” is assigned toEngland and the value “03” is assigned toNorthern Ireland.The school identification consists of two
variables, which together form a uniqueidentifier for each school:• The country identification variable labelled
COUNTRY. The country codes used in PISAare the ISO 3166 country codes.
• The school identification variable labelledSCHOOLID.
The student files
Two types of indices are provided in thestudent questionnaire files. The first set is basedon a transformation of one variable or on acombination of the information included in twoor more variables. Eight indices are included inthe database from this first type. The second setis the result of a Rasch scaling and consists ofweighted likelihood estimate indices. Fifteenindices from the student questionnaire, 14indices from the international option on cross-curriculum competencies questionnaire and 3indices from the international option oncomputer familiarity or information technologyquestionnaire are included in the database fromthis second type. For a full description of theindices and how to interpret them see theManual for the PISA 2000 database (OECD,2002a), also available throughwww.pisa.oecd.org.
Two types of estimates of student achievementare included in the student files for each domain(reading, mathematics and science) and for eachsubscale (retrieving information, interpretingtexts and reflection and evaluation): a weightedlikelihood estimate and a set of five plausiblevalues. It is recommended that the set of plausiblevalues be used when analysing and reportingstatistics at the population level. The use ofweighted likelihood estimates for populationestimates will yield biased estimates. The weightedlikelihood estimates for the domain scales wereeach transformed to a mean of 500 and a standarddeviation of 100 by using the data of the partic-ipating OECD Member countries only2. These
Inte
rnat
ion
al d
atab
ase
255
2 With the exception of the Netherlands, which didnot reach the PISA 2000 sampling standards
linear transformations used weighted data, butthe weights provided in the internationaldatabase were transformed so that the sum ofthe weights per country was a constant. Thesame transformation was applied to the readingsub-scales. The linear transformation applied tothe plausible values is based on the same rules asthe ones used for the weighted estimates, exceptthat the standardisation parameters were derivedfrom the average of the mean and standarddeviation computed from the five plausiblevalues. For this reason, the means and variancesof the individual plausible values are not exactly500 and 100 respectively, but the average of thefive means and the five variances is.
In the international data files, the variablecalled W_FSTUWT is the final student weight.The sum of the weights constitutes an estimateof the size of the target population, i.e., the numberof 15-year-old students in that country attendingschool. In this situation large countries wouldhave a stronger contribution to the results thansmall countries. These weights are appropriate forthe analysis of data that have been collectedfrom all assessed students, such as student question-naire data, and reading performance data.
Because of the PISA test design, using thereading weights for analysing the mathematics orscience data will overweight the studentsassessed with booklet zero (or SE Booklet) andtherefore (typically) underestimate the results. Tocorrect this over-weighting of the students takingthe SE booklet, weight adjustment factors mustbe applied to the weights and replicates. Becauseof the need to use these adjustment factors inanalyses, and to avoid accidental misuse of thestudent data, these data are provided in the threeseparate files: • The file Intstud_read.txt comprises the reading
ability estimates and weights. This file containsall eligible students who participated in thesurvey. As the sample design assessed readingby all students, no adjustment was needed;
• The file Intstud_math.txt comprises thereading and mathematics ability estimates.Weights and replicates in this file have alreadybeen adjusted by the mathematics adjustmentfactor. Thus, no further transformations ofthe weights or replicates are required byanalysts of the data; and
• The file Intstud_scie.txt comprises the readingand science ability estimates. Weights and re-
plicates in this file have already been adjustedby the science adjustment factor. Thus, nofurther transformations of the weights orreplicates are required by analysts of the data.For a full description of the weighting
methodology, including the test design,calculation of the reading weights, adjustmentfactors and how to use the weights, see theManual for the PISA 2000 database (OECD,2002a), also available throughwww.pisa.oecd.org.
The cognitive files
The file with the test data (filename:intcogn.txt) contains individual students’responses to all items used for the internationalitem calibration and in the generation of theplausible values. All item responses included inthis file have a one-digit format, which containsthe score for the student on that item.
The PISA items are organised into units. Eachunit consists of a piece of text or related texts,followed by one or more questions. Each unit isidentified by a short label and by a long label. Theunits’ short labels consist of four characters. Thefirst character is R, M or S respectively for reading,mathematics or science. The three next charactersindicate the unit name. For example, R083 is areading unit called ‘Household’. The full itemlabel (usually seven-digit) represents each particularquestion within a unit. Thus items within a unithave the same initial four characters: all items inthe unit ‘Household’ begin with ‘R083’, plus aquestion number: for example, the third questionin the ‘Household’ unit is R083Q03.
In this file, the items are sorted by domainand alphabetically by short label within domain.This means that the mathematics items appear atthe beginning of the file, followed by the readingitems and then the science items. Withindomains, units with smaller numericidentification appear before those with largeridentification, and within each unit, the firstquestion will precede the second, and so on.
The school file
The school files contain the original variablescollected through the school contextquestionnaire.
Two types of indices are provided in the
PIS
A 2
00
0 t
ech
nic
al r
epor
t
256
school questionnaire files. The first set is basedon a transformation of one variable or on acombination of two or more variables. The data-base includes 16 indices from this first type. Thesecond set is the result of a Rasch scaling andconsists of weighted likelihood estimate indices.Eight indices are included in the database fromthis second type. For a full description of theindices and how to interpret them see the Manualfor the PISA 2000 database (OECD, 2002a),also available through www.pisa.oecd.org.
The school base weight (WNRSCHBW),which has been adjusted for school non-response, is provided at the end of the schoolfile. PISA uses an age sample instead of a gradesample. Additionally, the PISA sample of schoolin some countries included primary schools,lower secondary schools, upper secondaryschools, or even special education schools. Forthese two reasons, it is difficult to conceptuallydefine the school population, except that it is thepopulation of schools with at least one 15-year-old student. While in some countries, thepopulation of schools with 15-year-olds issimilar to the population of secondary schools,in other countries, these two populations ofschools are very different.
A recommendation is to analyse the schooldata at the student level. From a practical pointof view, it means that the school data should tobe imported into the student data file. From atheoretical point of view, while it is possible toestimate the per centage of schools following aspecific school characteristic, it is notmeaningful. Instead, the recommendation is toestimate the per centages of students followingthe same school characteristic. For instance, theper centages of private schools versus publicschools will not be estimated, but the percentages of students attending a private schoolversus the per centage of students attendingpublic schools will.
Further information
A full description of the PISA 2000 databaseand guidelines on how to analyse it inaccordance with the complex methodologiesused to collect and process the data is providedin the Manual for the PISA 2000 database(OECD, 2002a) also available throughwww.pisa.oecd.org
Inte
rnat
ion
al d
atab
ase
257
References
Adams, R.J., Wilson, M.R. and Wang, W. (1997).The multidimensional random coefficientsmultinomial logit model. Applied PsychologicalMeasurement, 21, 1-24.
Adams, R.J., Wilson, M.R. and Wu, M.L. (1997).Multilevel item response modeling: An approachto errors in variables regression. Journal ofEducational and Behavioral Statistics, 22, 47-76.
Beaton, A.E. (1987). Implementing the new design:The NAEP 1983-84 Technical Report. (ReportNo. 15-TR-20). Princeton, NJ: EducationalTesting Service.
Beaton, A.E., Mullis, I.V., Martin, M.O., Gonzalez,E.J., Kelly, D.L. and Smith, T.A. (1996).Mathematics Achievement in the Middle SchoolYears. Chestnut Hill, MA: Center for the Studyof Testing, Evaluation and Educational Policy,Boston College.
Baumert, J., Gruehn, S., Heyn, S., Köller, O. andSchnabel, K.U. (1997). Bildungsverläufe undPsychosoziale Entwicklung im Jugendalter(BIJU): Dokumentation - Band 1. Berlin: Max-Planck-Institut für Bildungsforschung.
Baumert, J., Heyn. S. and Köller, O. (1994). DasKieler Lernstrategien-Inventar. Kiel: Institut fürdie Pädagogik der Naturwissenschaften an derUniversität Kiel.
BMDP. (1992). BMDP Statistical Software. LosAngeles: Author.
Brennan, R.L. (1992). Elements of GeneralizabilityTheory. Iowa City: American College TestingProgram.
Browne, M.W. and Cudeck, R. (1993) AlternativeWays of Assessing Model Fit. In: K. Bollen andS. Long: Testing Structural Equation Models.Newbury Park: Sage.
Cochran, W.G. (1977). Sampling Techniques (3rd
edition). New York: Wiley.
Cronbach, L.J., Gleser, G.C., Nanda, H. andRajaratnam, N. (1972). The Dependability ofBehavioral Measurements: Theory ofGeneralizability for Scores and Profiles. NewYork: Wileyand
Dale, E. and Chall, J.S. (1948). A Formula forPredicting Readability. Columbus: Bureau ofEducational Research, Ohio State University.
Dechant, E. (1991). Understanding and TeachingReading: An Interactive Model. Hillsdale, NJ:Lawrence Erlbaum.
De Landsheere, G. (1973). Le test de closure,mesure de la lisibilité et de la compréhension.Bruxelles: Nathan-Labor.
Eignor, D., Taylor, C., Kirsch, I. and Jamieson, J.(1998). Development of a Scale for Assessing theLevel of Computer Familiarity of TOEFLStudents. (TOEFL Research Report No. 60).Princeton, NJ: Educational Testing Service.
Elley, W.B. (1992). How in the World do StudentsRead. International Association for the Evaluationof Educational Achievement: The Hague
Flesch, R. (1949). The Art of Readable Writing.New York: Harper & Row
Fry, E.B. (1977). Fry’s Readability graph:Clarification, validity and extension to level 17.Journal of Reading, 20, 242-252.
Ganzeboom, H.B.G., de Graaf, P.M. and Treiman,D.J. (1992). A standard international socio-economic index of occupational status. SocialScience Research, 21, 1-56.
Gifi, A. (1990). Nonlinear Multivariate Analysis.New York: Wiley.
Graesser, A.C., Millis, K.K. and Zwaan, R.A.(1997). Discourse comprehension. AnnualReview of Psychology, 48, 163-189.
Greenacre, M.J. (1984). Theory and Applications ofCorrespondence Analysis. London: Academic Press.
Gustafsson, J.E. and Stahl, P.A. (2000). STREAMSUser’s Guide, Version 2.5 for Windows. MölndalSweden: MultivariateWare.
Henry, G. (1975). Comment mesurer la lisibilité.Bruxelles: Nathan-Labor.
Hambleton, R., Merenda, P. and Spielberger, C.D.(in press). Adapting Educational andPsychological Tests for Cross-CulturalAssessment. Hillsdale, NJ: Lawrence Erlbaum.
International Assessment of Educational Progress.(1992). Learning Science. Princeton, NJ:Educational Testing Service.
International Labour Organisation. (1990).International Standard Classification ofOccupations: ISCO-88. Geneva: Author.
Jöreskog, K.G. and Sörbom, D. (1993). LISREL8User’s Reference Guide. Chicago: ScientificSoftware International.
Ap
pen
dic
es
259
Judkins, D.R. (1990). Fay’s Method for VarianceEstimation. Journal of Official Statistics, 6(3),223-239.
Kintsch, W. and van Dijk, T. (1978). Toward amodel of text comprehension and production.Psychological Review, 85, 363-394.
Kirsch, I. and Mosenthal, P. (1990). ExploringDocument Literacy: Variables underlying theperformance of young adults. Reading ResearchQuarterly, 25, 5-30
Kish, L. (1965). Survey Sampling. New York: Wiley.
Longford, N.T. (1995). Models of Uncertainty inEducational Testing. New York: Springer-Verlag.
Macaskill, G., Adams, R.J. and Wu, M.L. (1998).Scaling methodology and procedures for themathematics and science literacy, advancedmathematics and physics scales. In: M. Martinand D.L. Kelly (eds.) Third InternationalMathematics and Science Study, TechnicalReport Volume 3: Implementation and Analysis.Chestnut Hill, MA: Center for the Study ofTesting, Evaluation and Educational Policy,Boston College.
McCormick, T.W. (1988). Theories of Reading inDialogue: An Interdisciplinary Study. New York:University Press of America.
McCullagh, P. and Nelder, J.A. (1983). GeneralizedLinear Models (2nd edition). London: Chapmanand Hall.
Marsh, H.W., Shavelson, R.J. and Byrne, B.M.(1992). A multidimensional, hierarchical self-concept. In: R.P. Lipka and T.M. Brinthaupt(eds.), Studying the Self: Self-perspectives Acrossthe Life-Span. Albany: State University of NewYork Press.
Mislevy, R.J. (1991). Randomization-basedinference about latent variables from complexsamples. Psychometrika, 56, 177–196.
Mislevy, R.J. and Sheehan, K.M. (1987). Marginalestimation procedures. In: A.E. Beaton (ed.), TheNAEP 1983-1984 Technical Report. (ReportNo. 15-TR-20). Princeton, NJ: EducationalTesting Service, 293-360.
Mislevy, R.J. and Sheehan, K.M. (1989).Information matrices in latent-variable models.Journal of Educational Statistics, 14(4), 335-350.
Mislevy, R.J., Beaton, A.E., Kaplan, B. andSheehan, K.M. (1992). Estimating populationcharacteristics from sparse matrix samples of
item responses. Journal of EducationalMeasurement, 29, 133–161.
Nishisato, S. (1980). Analysis of Categorical Data:Dual Scaling and its Applications. Toronto:University of Toronto Press.
Oh, H.L. and Sheuren, F.J. (1983). WeightingAdjustment for Unit Nonresponse. In: W.G.Madow, I. Olkin and D. Rubin (eds.) IncompleteData in Sample Surveys, Vol. 2: Theory andBibliographies. New York: Academic Press, 143-184.
Organisation for Economic Co-Operation andDevelopment (1999a). Measuring StudentKnowledge and Skills: A New Framework forAssessment. Paris: OECD Publications.
Organisation for Economic Co-Operation andDevelopment (1999b). Classifying EducationalProgrammes. Manual for ISCED-97Implementation in OECD Countries. Paris:OECD Publications.
Organisation for Economic Co-Operation andDevelopment (2000). Literacy in theInformation Age. Paris: OECD Publications.
Organisation for Economic Co-Operation andDevelopment (2001). Knowledge and Skills forLife: First Results from PISA 2000. Paris: OECDPublications.
Organisation for Economic Co-Operation andDevelopment (2002a). Manual for the PISA2000 Database. Paris OECD Publications.
Organisation for Economic Co-Operation andDevelopment (2002b). Sample Tasks from thePISA 2000 Assessment. Paris OECDPublications.
Owens, L. and Barnes, J. (1992). LearningPreferences Scales. Hawthorn, Vic.: AustralianCouncil for Educational Research.
Peschar, J.L. (1997). Prepared for Life? How toMeasure Cross-Curricular Competencies. Paris:OECD Publications.
Pintrich, P.R., Smith, D.A.F., Garcia, T. andMcKeachie, W.J. (1993). Reliability andpredictive validity of the Motivated Strategies forLearning questionnaire (MLSQ). Educationaland Psychological Measurement, 53, 801-813.
Robitaille, D.F. and Garden, R.A. (eds.) (1989).The IEA Study of Mathematics II: Contexts andOutcomes of School Mathematics. Oxford:Pergamon Press.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
260
Rubin, D.B. (1987). Multiple Imputations for Non-Response in Surveys. New York: Wiley.
Rumelhart, D.E. (1985). Toward an interactivemodel of reading. In: H. Singer, and R.B.Ruddell (eds.), Theoretical models and theprocesses of reading (3rd edition). Newark, DE:International Reading Association.
Rust, K. (1985). Variance estimation for complexestimators in sample surveys. Journal of OfficialStatistics, 1(4), 381-397.
Rust, K.F. and Rao, J.N.K. (1996). Varianceestimation for complex surveys using replicationtechniques. Statistical Methods in MedicalResearch, 5(3), 283-310.
Särndal, C.E., Swensson, B. and Wretman, J.(1992). Model Assisted Survey Sampling. NewYork: Springer-Verlag.
Searle, S.R. (1971). Linear Models. New York:Wiley.
Shao, J. (1996). Resampling methods in samplesurveys (with discussion). Statistics, 27, 203-254.
Sirotnik, K. (1970). An analysis of varianceframework for matrix sampling. Educationaland Psychological Measurement, 30, 891-908.
Spache, G. (1953). A new readability formula forprimary grade reading materials. ElementarySchool Journal, 55, 410-413.
Thorndike, R.L. (1973). Reading ComprehensionEducation in Fifteen Countries. Uppsala:Almqvist & Wiksell.
Travers, K.J. and Westbury, I. (eds.) (1989). TheIEA Study of Mathematics I: Analysis ofMathematics Curricula. Oxford: PergamonPress.
Van Dijk, T.A. and Kintsch, W. (1983). Strategiesof Discourse Comprehension. Orlando:Academic Press.
Verhelst, N.D. (2000). Estimating VarianceComponents in Unbalanced Designs. (R&DNotities, 2000-1). Arnhem: Cito Group.
Warm, T.A. (1985). Weighted maximum likelihoodestimation of ability in Item Response Theorywith tests of finite length. (Technical ReportCGI-TR-85-08). Oklahoma City: U.S. CoastGuard Institute.
Westat (2000). WesVar 4.0 User’s Guide. Rockville,MD: Author.
Wolter, K.M. (1985). Introduction to VarianceEstimation. New York: Springer-Verlag.
Wu, M.L. (1997). The developement andapplication of a fit test for use with generaliseditem response model. Masters of EducationDissertation. University of Melbourne.
Wu, M.L., Adams, R.J. and Wilson, M.R. (1997).ConQuest: Multi-Aspect Test Software[computer program]. Camberwell, Vic.:Australian Council for Educational Research.
Ap
pen
dic
es
261
PIS
A 2
00
0 t
ech
nic
al r
epor
t
262
Ap
pe
nd
ix 1
: S
um
ma
ry
of
PIS
A R
ea
din
g L
ite
ra
cy
It
em
s
Iden
tifi
cati
onN
ame
Sour
ceL
angu
age
Sub-
Num
ber
in w
hich
Sc
ale
Clu
ster
Inte
rnat
iona
l It
em P
aram
eter
sT
hres
hold
sSu
bmit
ted
Per
Cen
tC
orre
ctD
iffi
cult
yT
au-1
Tau
-2T
au-3
12
3
R04
0Q02
Lak
e C
had
AC
ER
Eng
lish
Ret
r. in
f.R
964
.82
(0.3
9)-0
.221
478
R04
0Q03
AL
ake
Cha
dA
CE
RE
nglis
hR
etr.
inf.
R9
50.3
9 (0
.39)
0.45
954
0R
040Q
03B
Lak
e C
had
AC
ER
Eng
lish
Refl
. & ev
al.R
936
.58
(0.3
4)1.
123
600
R04
0Q04
Lak
e C
had
AC
ER
Eng
lish
Int.
tex
tsR
976
.94
(0.2
8)-1
.112
397
R04
0Q06
Lak
e C
had
AC
ER
Eng
lish
Int.
tex
tsR
956
.44
(0.3
7)0.
106
508
R05
5Q01
Dru
gged
Spi
ders
Cit
oE
nglis
hIn
t. t
exts
R5
83.7
9 (0
.22)
-1.3
7737
3R
055Q
02D
rugg
ed S
pide
rsC
ito
Eng
lish
Refl
. & ev
al.R
552
.93
(0.2
7)0.
496
543
R05
5Q03
Dru
gged
Spi
ders
Cit
oE
nglis
hIn
t. t
exts
R5
60.5
7 (0
.29)
0.06
750
4R
055Q
05D
rugg
ed S
pide
rsC
ito
Eng
lish
Int.
tex
tsR
577
.45
(0.2
6)-0
.877
419
R06
1Q01
Mac
ondo
AC
ER
Span
ish
Int.
tex
tsR
665
.98
(0.2
9)-0
.112
488
R06
1Q03
Mac
ondo
AC
ER
Span
ish
Int.
tex
tsR
662
.33
(0.2
7)0.
005
499
R06
1Q04
Mac
ondo
AC
ER
Span
ish
Int.
tex
tsR
668
.36
(0.2
6)-0
.313
470
R06
1Q05
Mac
ondo
AC
ER
Span
ish
Refl
. & ev
al.R
654
.25
(0.3
3)0.
499
544
R06
7Q01
Aes
opG
reec
eG
reek
Int.
tex
tsR
588
.35
(0.1
9)-1
.726
341
R06
7Q04
Aes
opG
reec
eG
reek
Refl
. & ev
al.R
554
.31
(0.2
2)0.
516
-0.4
560.
456
479
611
R06
7Q05
Aes
opG
reec
eG
reek
Refl
. & ev
al.R
562
.48
(0.2
9)0.
182
0.48
2-0
.482
487
543
R07
0Q02
Bea
chC
ito
Eng
lish
Ret
r. in
f.R
152
.74
(0.3
2)0.
485
542
R07
0Q03
Bea
chC
ito
Eng
lish
Ret
r. in
f.R
170
.46
(0.2
5)-0
.487
454
R07
0Q04
Bea
chC
ito
Eng
lish
Refl
. & ev
al.R
117
.52
(0.2
2)2.
581
733
R07
0Q07
TB
each
Cit
oE
nglis
hIn
t. t
exts
R1
54.3
5 (0
.26)
0.42
8-0
.122
0.12
248
858
6R
076Q
03Ir
an A
irC
ito
Eng
lish
Ret
r. in
f.R
568
.58
(0.3
0)-0
.284
473
R07
6Q04
Iran
Air
Cit
oE
nglis
hIn
t. t
exts
R5
53.8
8 (0
.28)
0.47
554
2R
076Q
05Ir
an A
irC
ito
Eng
lish
Ret
r. in
f.R
552
.61
(0.2
8)0.
528
546
R07
7Q02
Flu
AC
ER
Eng
lish
Ret
r. in
f.R
970
.02
(0.3
4)-0
.607
443
R07
7Q03
Flu
AC
ER
Eng
lish
Refl
. & ev
al.R
944
.28
(0.3
4)0.
706
0.79
8-0
.798
542
583
R07
7Q04
Flu
AC
ER
Eng
lish
Int.
tex
tsR
953
.34
(0.3
2)0.
247
521
R07
7Q05
Flu
AC
ER
Eng
lish
Refl
. & ev
al.R
930
.73
(0.2
9)1.
521
637
R07
7Q06
Flu
AC
ER
Eng
lish
Int.
tex
tsR
944
.68
(0.4
0)0.
699
562
R08
1Q01
Gra
ffit
iFi
nlan
dFi
nnis
hIn
t. t
exts
R1
76.3
4 (0
.23)
-0.8
5242
1R
081Q
02G
raff
iti
Finl
and
Finn
ish
Int.
tex
tsR
1R
081Q
05G
raff
iti
Finl
and
Finn
ish
Int.
tex
tsR
152
.65
(0.2
6)0.
475
542
R08
1Q06
AG
raff
iti
Finl
and
Finn
ish
Refl
. & ev
al.R
167
.14
(0.2
6)-0
.303
471
R08
1Q06
BG
raff
iti
Finl
and
Finn
ish
Refl
. & ev
al.R
144
.75
(0.3
1)0.
906
581
R08
3Q01
Hou
seho
ld W
ork
AC
ER
Eng
lish
Int.
tex
tsR
666
.36
(0.2
8)-0
.163
484
Iden
tifi
cati
onN
ame
Sour
ceL
angu
age
Sub-
Num
ber
in w
hich
Sc
ale
Clu
ster
Inte
rnat
iona
l It
em P
aram
eter
sT
hres
hold
sSu
bmit
ted
Per
Cen
tC
orre
ctD
iffi
cult
yT
au-1
Tau
-2T
au-3
12
3
R08
3Q02
Hou
seho
ld W
ork
AC
ER
Eng
lish
Ret
r. in
f.R
685
.12
(0.2
1)-1
.422
369
R08
3Q03
Hou
seho
ld W
ork
AC
ER
Eng
lish
Ret
r. in
f.R
680
.53
(0.2
3)-1
.036
404
R08
3Q04
Hou
seho
ld W
ork
AC
ER
Eng
lish
Int.
tex
tsR
668
.06
(0.2
8)-0
.198
480
R08
3Q06
Hou
seho
ld W
ork
AC
ER
Eng
lish
Refl
. & ev
al.R
647
.17
(0.3
2)0.
845
575
R08
6Q04
IfA
CE
RE
nglis
hR
efl. &
eval.
R4
29.0
0 (0
.24)
1.79
166
1R
086Q
05If
AC
ER
Eng
lish
Int.
tex
tsR
483
.48
(0.2
2)-1
.367
374
R08
6Q07
IfA
CE
RE
nglis
hR
efl. &
eval.
R4
72.7
4 (0
.26)
-0.6
8343
6R
088Q
01L
abou
rN
Z’la
ndE
nglis
hIn
t. t
exts
R9
62.7
6 (0
.34)
-0.2
3547
7R
088Q
03L
abou
rN
Z’la
ndE
nglis
hR
etr.
inf.
R9
45.9
7 (0
.27)
0.65
8-0
.578
0.57
848
563
1R
088Q
04T
Lab
our
N Z
’land
Eng
lish
Int.
tex
tsR
939
.20
(0.2
5)1.
117
-1.3
361.
336
473
727
R08
8Q05
TL
abou
rN
Z’la
ndE
nglis
hR
efl. &
eval.
R9
68.8
3 (0
.32)
-0.5
8844
5R
088Q
07L
abou
rN
Z’la
ndE
nglis
hR
efl. &
eval.
R9
61.6
2 (0
.37)
-0.1
3548
6R
091Q
05L
ibra
ryA
CE
RE
nglis
hR
etr.
inf.
R3
85.7
8 (0
.21)
-1.4
8436
3R
091Q
06L
ibra
ryA
CE
RE
nglis
hIn
t. t
exts
R3
86.4
2 (0
.18)
-1.5
9735
3R
091Q
07B
Lib
rary
AC
ER
Eng
lish
Refl
. & ev
al.R
379
.45
(0.2
1)-1
.039
404
R09
3Q03
New
s A
genc
ies
Den
mar
kD
anis
hIn
t. t
exts
R5
74.6
9 (0
.26)
-0.5
6344
7R
093Q
04N
ews
Age
ncie
sD
enm
ark
Dan
ish
Int.
tex
tsR
5R
099Q
02Pl
an I
nter
nati
onal
AC
ER
Eng
lish
Int.
tex
tsR
7R
099Q
03Pl
an I
nter
nati
onal
AC
ER
Eng
lish
Int.
tex
tsR
7R
099Q
04B
Plan
Int
erna
tion
alA
CE
RE
nglis
hR
efl. &
eval.
R7
10.6
6 (0
.16)
2.91
6-0
.316
0.31
670
582
2R
100Q
04Po
lice
Bel
gium
Fren
chR
etr.
inf.
R6
61.1
3 (0
.29)
0.17
851
5R
100Q
05Po
lice
Bel
gium
Fren
chIn
t. t
exts
R6
58.2
6 (0
.28)
0.21
751
8R
100Q
06Po
lice
Bel
gium
Fren
chIn
t. t
exts
R6
80.2
4 (0
.24)
-1.0
1240
6R
100Q
07Po
lice
Bel
gium
Fren
chIn
t. t
exts
R6
80.6
5 (0
.24)
-1.0
6540
2R
101Q
01R
hino
Swed
enSw
edis
hIn
t. t
exts
R1
54.9
2 (0
.30)
0.45
754
0R
101Q
02R
hino
Swed
enSw
edis
hIn
t. t
exts
R1
86.1
1 (0
.17)
-1.6
2735
0R
101Q
03R
hino
Swed
enSw
edis
hR
efl. &
eval.
R1
63.9
3 (0
.28)
-0.0
8249
1R
101Q
04R
hino
Swed
enSw
edis
hIn
t. t
exts
R1
78.3
2 (0
.26)
-1.0
0240
7R
101Q
05R
hino
Swed
enSw
edis
hIn
t. t
exts
R1
47.4
1 (0
.30)
0.80
057
1R
101Q
08R
hino
Swed
enSw
edis
hIn
t. t
exts
R1
55.8
1 (0
.28)
0.38
053
3R
102Q
01Sh
irts
Cit
oE
nglis
hIn
t. t
exts
R4
81.1
0 (0
.24)
-1.3
0638
0R
102Q
04A
Shir
tsC
ito
Eng
lish
Int.
tex
tsR
436
.00
(0.2
9)1.
206
608
R10
2Q05
Shir
tsC
ito
Eng
lish
Int.
tex
tsR
441
.80
(0.3
0)0.
905
581
R10
2Q06
Shir
tsC
ito
Eng
lish
Refl
. & ev
al.R
433
.44
(0.2
5)1.
481
633
R10
2Q07
Shir
tsC
ito
Eng
lish
Int.
tex
tsR
485
.23
(0.2
2)-1
.566
356
R10
4Q01
Tele
phon
eN
Z’la
ndE
nglis
hR
etr.
inf.
R6
82.6
3 (0
.23)
-1.2
3538
6
Ap
pen
dic
es
263
Iden
tifi
cati
onN
ame
Sour
ceL
angu
age
S
ub-
Num
ber
in w
hich
Sc
ale
Clu
ster
Inte
rnat
iona
l It
em P
aram
eter
sT
hres
hold
sSu
bmit
ted
Per
Cen
tC
orre
ctD
iffi
cult
yT
au-1
Tau
-2T
au-3
12
3
R10
4Q02
Tele
phon
eN
ew Z
eala
ndE
nglis
hR
etr.
inf.
R6
41.3
0 (0
.29)
1.10
559
9R
104Q
05Te
leph
one
New
Zea
land
Eng
lish
Ret
r. in
f.R
628
.89
(0.1
8)1.
875
-0.9
140.
914
574
764
R10
4Q06
Tele
phon
eN
ew Z
eala
ndE
nglis
hR
etr.
inf.
R6
79.5
1 (0
.22)
-0.9
3141
4R
110Q
01R
unne
rsB
elgi
umFr
ench
Int.
tex
tsR
884
.50
(0.2
4)-1
.561
356
R11
0Q04
Run
ners
Bel
gium
Fren
chR
etr.
inf.
R8
78.8
4 (0
.27)
-1.1
6639
2R
110Q
05R
unne
rsB
elgi
umFr
ench
Ret
r. in
f.R
876
.21
(0.2
4)-1
.025
405
R11
0Q06
Run
ners
Bel
gium
Fren
chR
efl. &
eval.
R8
77.7
8 (0
.23)
-1.0
5940
2R
111Q
01E
xcha
nge
Finl
and
Finn
ish
Int.
tex
tsR
463
.87
(0.2
8)-0
.053
494
R11
1Q02
BE
xcha
nge
Finl
and
Finn
ish
Refl
. & ev
al.R
434
.14
(0.2
4)1.
365
-0.5
540.
554
551
694
R11
1Q04
Exc
hang
eFi
nlan
dFi
nnis
hR
etr.
inf.
R4
77.7
0 (0
.25)
-0.8
6841
9R
111Q
06B
Exc
hang
eFi
nlan
dFi
nnis
hR
efl. &
eval.
R4
44.4
2 (0
.29)
0.80
80.
828
-0.8
2855
259
2R
119Q
01G
ift
USA
Eng
lish
Int.
tex
tsR
373
.46
(0.2
3)-0
.563
447
R11
9Q04
Gif
tU
SAE
nglis
hIn
t. t
exts
R3
40.3
8 (0
.30)
1.14
960
3R
119Q
05G
ift
USA
Eng
lish
Refl
. & ev
al.R
337
.10
(0.2
5)1.
226
0.03
0-0
.030
567
652
R11
9Q06
Gif
tU
SAE
nglis
hR
etr.
inf.
R3
85.2
9 (0
.22)
-1.4
4836
7R
119Q
07G
ift
USA
Eng
lish
Int.
tex
tsR
342
.81
(0.2
6)1.
031
-0.2
160.
216
539
645
R11
9Q08
Gif
tU
SAE
nglis
hIn
t. t
exts
R3
56.6
4 (0
.28)
0.33
852
9R
119Q
09T
Gif
tU
SAE
nglis
hR
efl. &
eval.
R3
63.9
8 (0
.28)
0.11
60.
450
-0.4
5048
053
7R
120Q
01St
uden
t O
pini
ons
AC
ER
Eng
lish
Int.
tex
tsR
757
.89
(0.2
5)0.
202
517
R12
0Q03
Stud
ent
Opi
nion
sA
CE
RE
nglis
hIn
t. t
exts
R7
43.0
4 (0
.25)
1.00
459
0R
120Q
06St
uden
t O
pini
ons
AC
ER
Eng
lish
Refl
. & ev
al.R
761
.33
(0.2
5)0.
105
508
R12
0Q07
TSt
uden
t O
pini
ons
AC
ER
Eng
lish
Refl
. & ev
al.R
771
.82
(0.2
4)-0
.629
441
R12
2Q01
AJu
star
tA
CE
RE
nglis
hR
5R
122Q
01B
Just
art
AC
ER
Eng
lish
R5
R12
2Q01
CJu
star
tA
CE
RE
nglis
hR
5R
122Q
02Ju
star
tA
CE
RE
nglis
hR
efl. &
eval.
R5
59.8
2 (0
.29)
0.13
851
1R
122Q
03T
Just
art
AC
ER
Eng
lish
Ret
r. in
f.R
543
.48
(0.2
5)0.
894
-0.0
290.
029
535
625
R21
6Q01
Am
anda
&
Fran
ceFr
ench
Int.
tex
tsR
973
.54
(0.3
6)-0
.833
423
the
Duc
hess
R21
6Q02
Am
anda
&
Fran
ceFr
ench
Refl
. & ev
al.R
944
.22
(0.4
1)0.
690
561
the
Duc
hess
R21
6Q03
TA
man
da &
Fr
ance
Fren
chIn
t. t
exts
R9
43.9
0 (0
.35)
0.75
156
7th
e D
uche
ssR
216Q
04A
man
da &
Fr
ance
Fren
chR
etr.
inf.
R9
36.6
1 (0
.35)
1.20
660
8th
e D
uche
ssR
R21
6Q06
Am
anda
&
Fran
ceFr
ench
Int.
tex
tsR
967
.10
(0.3
2)-0
.474
455
the
Duc
hess
PIS
A 2
00
0 t
ech
nic
al r
epor
t
264
Iden
tifi
cati
onN
ame
Sour
ceL
angu
age
S
ub-
Num
ber
in w
hich
Sc
ale
Clu
ster
Inte
rnat
iona
l It
em P
aram
eter
sT
hres
hold
sSu
bmit
ted
Per
Cen
tC
orre
ctD
iffi
cult
yT
au-1
Tau
-2T
au-3
12
3
R21
9Q01
EE
mpl
oym
ent
IAL
SIA
LS
Ret
r. in
f.R
169
.94
(0.2
7)-0
.550
448
R21
9Q01
TE
mpl
oym
ent
IAL
SIA
LS
Int.
tex
tsR
157
.37
(0.3
1)0.
278
524
R21
9Q02
Em
ploy
men
tIA
LS
IAL
SR
efl. &
eval.
R1
76.2
4 (0
.27)
-0.9
1741
5R
220Q
01So
uth
Pole
Fran
ceFr
ench
Ret
r. in
f.R
746
.03
(0.3
1)0.
785
570
R22
0Q02
BSo
uth
Pole
Fran
ceFr
ench
Int.
tex
tsR
764
.49
(0.3
0)-0
.144
485
R22
0Q04
Sout
h Po
leFr
ance
Fren
chIn
t. t
exts
R7
60.6
7 (0
.28)
0.16
351
3R
220Q
05So
uth
Pole
Fran
ceFr
ench
Int.
tex
tsR
784
.88
(0.2
1)-1
.599
353
R22
0Q06
Sout
h Po
leFr
ance
Fren
chIn
t. t
exts
R7
65.5
4 (0
.27)
-0.1
7248
3R
225Q
01N
ucle
arIA
LS
IAL
SIn
t. t
exts
R2
R22
5Q02
Nuc
lear
IAL
SIA
LS
Int.
tex
tsR
270
.77
(0.2
5)-0
.313
470
R22
5Q03
Nuc
lear
IAL
SIA
LS
Ret
r. in
f.R
289
.87
(0.1
7)-1
.885
327
R22
5Q04
Nuc
lear
IAL
SIA
LS
Ret
r. in
f.R
275
.80
(0.2
8)-0
.663
438
R22
7Q01
Opt
icia
nSw
itze
rlan
dG
erm
anIn
t. t
exts
R4
57.6
5 (0
.34)
0.19
651
6R
227Q
02T
Opt
icia
nSw
itze
rlan
dG
erm
anR
etr.
inf.
R4
59.5
8 (0
.20)
0.04
5-1
.008
1.00
840
160
4R
227Q
03O
ptic
ian
Swit
zerl
and
Ger
man
Refl
. & ev
al.R
455
.58
(0.3
2)0.
295
525
R22
7Q04
Opt
icia
nSw
itze
rlan
dG
erm
anIn
t. t
exts
R4
69.8
5 (0
.28)
-0.2
160.
457
-0.4
5745
050
7R
227Q
06O
ptic
ian
Swit
zerl
and
Ger
man
Ret
r. in
f.R
474
.29
(0.2
6)-0
.916
415
R22
8Q01
Gui
deFi
nlan
dFi
nnis
hIn
t. t
exts
R7
67.2
4 (0
.27)
-0.3
3546
8R
228Q
02G
uide
Finl
and
Finn
ish
Int.
tex
tsR
759
.36
(0.2
3)0.
194
516
R22
8Q04
Gui
deFi
nlan
dFi
nnis
hIn
t. t
exts
R7
57.0
3 (0
.30)
0.24
752
1R
234Q
01pe
rson
nel
IAL
SIA
LS
Ret
r. in
f.R
285
.14
(0.2
0)-1
.487
363
R23
4Q02
pers
onne
lIA
LS
IAL
SR
etr.
inf.
R2
31.7
6 (0
.28)
1.72
865
5R
236Q
01ne
wru
les
IAL
SIA
LS
Int.
tex
tsR
847
.88
(0.3
1)0.
660
558
R23
6Q02
new
rule
sIA
LS
IAL
SIn
t. t
exts
R8
25.5
2 (0
.25)
1.87
366
9R
237Q
01H
irin
g In
terv
iew
IAL
SIA
LS
Ret
r. in
f.R
859
.59
(0.2
6)-0
.005
498
R23
7Q03
Hir
ing
Inte
rvie
wIA
LS
IAL
SIn
t. t
exts
R8
57.6
0 (0
.30)
0.13
251
0R
238Q
01B
icyc
leIA
LS
IAL
SR
etr.
inf.
R2
65.0
7 (0
.30)
-0.0
9249
0R
238Q
02B
icyc
leIA
LS
IAL
SIn
t. t
exts
R2
49.4
6 (0
.27)
0.72
256
4R
239Q
01A
llerg
ies/
Exp
lore
rsIA
LS
IAL
SIn
t. t
exts
R8
52.3
8 (0
.30)
0.46
954
1R
239Q
02A
llerg
ies/
Exp
lore
rsIA
LS
IAL
SR
etr.
inf.
R8
66.7
8 (0
.27)
-0.2
5747
5R
241Q
01W
arra
ntyh
otpo
int
IAL
SIA
LS
Ret
r. in
f.R
2R
241Q
02W
arra
ntyh
otpo
int
IAL
SIA
LS
Int.
tex
tsR
257
.59
(0.3
0)0.
432
538
R24
5Q01
Mov
ie R
evie
ws
IAL
SIA
LS
Ret
r. in
f.R
272
.69
(0.2
5)-0
.556
448
R24
5Q02
Mov
ie R
evie
ws
IAL
SIA
LS
Int.
tex
tsR
266
.63
(0.2
8)-0
.043
494
R24
6Q01
Con
tact
empl
oyer
IAL
SIA
LS
Ret
r. in
f.R
875
.08
(0.2
9)-0
.805
425
R24
6Q02
Con
tact
empl
oyer
IAL
SIA
LS
Ret
r. in
f.R
831
.96
(0.3
1)1.
560
640
Ap
pen
dic
es
265
Ap
pe
nd
ix 2
:S
um
ma
ry
of
PIS
A M
at
he
ma
tic
al
Lit
er
ac
y I
te
ms
Iden
tifi
cati
onN
ame
Sour
ceL
angu
age
Num
ber
in w
hich
C
lust
erIn
tern
atio
nal
Item
Par
amet
ers
Thr
esho
lds
Subm
itte
dPe
r C
ent
Cor
rect
Dif
ficu
lty
Tau
-1T
au-2
Tau
-31
23
M03
3Q01
A V
iew
Roo
mC
ito
Dut
chM
374
.03
(0.2
7)-1
.520
442
M03
4Q01
TB
rick
sC
ito
Dut
chM
338
.72
(0.2
8)0.
459
593
M03
7Q01
TFa
rms
Cit
oD
utch
M1
60.9
6 (0
.42)
-0.8
6749
2M
037Q
02T
Farm
sC
ito
Dut
chM
155
.16
(0.3
2)-0
.453
524
M12
4Q01
Wal
king
Cit
oD
utch
M1
34.4
1 (0
.40)
0.53
659
9M
124Q
03T
Wal
king
Cit
oD
utch
M1
18.8
9 (0
.25)
1.27
6-0
.258
0.16
90.
089
600
659
708
M13
6Q01
TA
pple
sC
ito
Dut
chM
249
.14
(0.2
8)-0
.139
548
M13
6Q02
TA
pple
sC
ito
Dut
chM
224
.88
(0.2
6)1.
265
655
M13
6Q03
TA
pple
sC
ito
Dut
chM
213
.19
(0.1
8)1.
820
0.38
5-0
.385
672
723
M14
4Q01
TC
ube
Pain
ting
USA
Eng
lish
M1
63.8
6 (0
.33)
-1.0
1248
1M
144Q
02T
Cub
e Pa
inti
ngU
SAE
nglis
hM
126
.26
(0.3
2)1.
090
641
M14
4Q03
Cub
e Pa
inti
ngU
SAE
nglis
hM
176
.82
(0.3
3)-1
.815
420
M14
4Q04
TC
ube
Pain
ting
USA
Eng
lish
M1
37.3
1 (0
.38)
0.43
159
1M
145Q
01T
Cub
esC
ito
Dut
chM
358
.28
(0.3
2)-0
.559
516
M14
8Q01
TC
onti
nent
Are
aSw
eden
Swed
ish
M2
M14
8Q02
TC
onti
nent
Are
aSw
eden
Swed
ish
M2
19.3
3 (0
.20)
1.47
4-0
.138
0.13
862
971
2M
150Q
01G
row
ing
Up
Cit
oD
utch
M2
61.5
7 (0
.28)
-0.6
8650
6M
150Q
02T
Gro
win
g U
pC
ito
Dut
chM
269
.36
(0.2
6)-1
.129
-0.5
020.
502
415
529
M15
0Q03
TG
row
ing
Up
Cit
oD
utch
M2
45.7
9 (0
.29)
0.00
955
9M
155Q
01Po
pula
tion
Pyr
amid
sC
ito
Dut
chM
360
.46
(0.3
4)-0
.745
501
M15
5Q02
TPo
pula
tion
Pyr
amid
sC
ito
Dut
chM
359
.49
(0.3
2)-0
.643
0.48
1-0
.481
486
532
M15
5Q03
TPo
pula
tion
Pyr
amid
sC
ito
Dut
chM
314
.54
(0.2
2)1.
718
0.22
9-0
.229
660
719
M15
5Q04
TPo
pula
tion
Pyr
amid
sC
ito
Dut
chM
351
.33
(0.2
7)-0
.230
541
M15
9Q01
Rac
ing
Car
Aus
tria
Ger
man
M4
67.2
2 (0
.33)
-0.8
6149
2M
159Q
02R
acin
g C
arA
ustr
iaG
erm
anM
483
.71
(0.2
6)-2
.037
403
M15
9Q03
Rac
ing
Car
Aus
tria
Ger
man
M4
82.8
1 (0
.22)
-1.8
9741
3M
159Q
05R
acin
g C
arA
ustr
iaG
erm
anM
428
.34
(0.3
3)1.
266
655
M16
1Q01
Tri
angl
esU
SAE
nglis
hM
458
.48
(0.3
2)-0
.283
537
M17
9Q01
TR
obbe
ries
TIM
SSE
nglis
hM
426
.31
(0.2
4)1.
328
-0.3
600.
360
609
710
M19
2Q01
TC
onta
iner
sG
erm
any
Ger
man
M3
37.8
1 (0
.28)
0.47
759
5M
266Q
01T
Car
pent
erA
CE
RE
nglis
hM
419
.92
(0.2
5)1.
858
700
M27
3Q01
TPi
pelin
esC
zech
Rep
.C
zech
M4
54.1
7 (0
.31)
-0.1
3054
8
PIS
A 2
00
0 t
ech
nic
al r
epor
t
266
Ap
pe
nd
ix 3
:S
um
ma
ry
of
PIS
A S
cie
nt
ific
Lit
er
ac
y I
te
ms
Iden
tifi
cati
onN
ame
Sour
ceL
angu
age
Num
ber
in w
hich
C
lust
erIn
tern
atio
nal
Item
Par
amet
ers
Thr
esho
lds
Subm
itte
dPe
r C
ent
Cor
rect
Dif
ficu
lty
Tau
-1T
au-2
Tau
-31
23
S114
Q03
TG
reen
hous
eC
ito
Eng
lish
S156
.79
(0.3
5)-0
.373
520
S114
Q04
TG
reen
hous
eC
ito
Eng
lish
S139
.29
(0.3
0)0.
377
-0.0
500.
050
545
645
S114
Q05
TG
reen
hous
eC
ito
Eng
lish
S124
.60
(0.3
2)1.
307
688
S128
Q01
Clo
ning
Cit
oFr
ench
S261
.64
(0.2
8)-0
.557
502
S128
Q02
Clo
ning
Cit
oFr
ench
S245
.38
(0.2
8)0.
284
586
S128
Q03
TC
loni
ngC
ito
Fren
chS2
61.1
6 (0
.27)
-0.5
2750
5S1
29Q
01D
aylig
htA
CE
RE
nglis
hS4
38.7
5 (0
.33)
0.62
061
9S1
29Q
02T
Day
light
AC
ER
Eng
lish
S417
.82
(0.2
5)1.
497
0.67
3-0
.673
682
732
S131
Q02
TG
ood
Vib
rati
ons
AC
ER
Eng
lish
S250
.51
(0.2
8)0.
028
560
S131
Q04
TG
ood
Vib
rati
ons
AC
ER
Eng
lish
S224
.74
(0.2
4)1.
438
701
S133
Q01
Res
earc
hU
SAE
nglis
hS1
56.2
2 (0
.33)
-0.3
5652
2S1
33Q
03R
esea
rch
USA
Eng
lish
S142
.24
(0.3
0)0.
313
589
S133
Q04
TR
esea
rch
USA
Eng
lish
S143
.25
(0.3
4)0.
250
582
S195
Q02
TSe
mm
elw
eis’
Dia
ryC
ito
Dut
chS3
24.9
5 (0
.26)
0.94
61.
273
-1.2
7363
866
6S1
95Q
04Se
mm
elw
eis’
Dia
ryC
ito
Dut
chS3
63.3
3 (0
.27)
-0.6
4349
3S1
95Q
05T
Sem
mel
wei
s’ D
iary
Cit
oD
utch
S367
.58
(0.2
9)-0
.900
467
S195
Q06
Sem
mel
wei
s’ D
iary
Cit
oD
utch
S360
.04
(0.2
7)-0
.494
508
S209
Q01
TT
idal
Pow
erA
CE
RE
nglis
hS3
S209
Q02
TT
idal
Pow
erA
CE
RE
nglis
hS3
43.3
1 (0
.32)
0.34
659
2S2
13Q
01T
Clo
thes
Aus
tral
iaE
nglis
hS1
40.0
1 (0
.33)
0.41
959
9S2
13Q
02C
loth
esA
ustr
alia
Eng
lish
S175
.41
(0.3
1)-1
.484
409
S252
Q01
Sout
h R
aine
aK
orea
Kor
ean
S348
.29
(0.2
7)0.
026
560
S252
Q02
Sout
h R
aine
aK
orea
Kor
ean
S371
.98
(0.2
4)-1
.123
445
S252
Q03
TSo
uth
Rai
nea
Kor
eaK
orea
nS3
54.5
9 (0
.26)
-0.1
7654
0S2
53Q
01T
Ozo
neN
orw
ayN
orw
egia
nS4
28.3
7 (0
.30)
0.97
80.
609
-0.6
0962
868
2S2
53Q
02O
zone
Nor
way
Nor
weg
ian
S434
.78
(0.3
2)0.
849
642
S253
Q05
Ozo
neN
orw
ayN
orw
egia
nS4
53.7
9 (0
.32)
-0.1
0854
7S2
56Q
01Sp
oons
TIM
SSE
nglis
hS1
88.2
7 (0
.20)
-2.4
9130
8S2
68Q
01A
lgae
Swed
enSw
edis
hS4
73.4
1 (0
.29)
-1.2
5043
2S2
68Q
02T
Alg
aeSw
eden
Swed
ish
S439
.83
(0.3
8)0.
578
615
S268
Q06
Alg
aeSw
eden
Swed
ish
S457
.39
(0.4
1)-0
.236
534
S269
Q01
Ear
th’s
Tem
pera
ture
Cit
oD
utch
S259
.28
(0.2
9)-0
.460
511
S269
Q03
TE
arth
’s T
empe
ratu
reC
ito
Dut
chS2
41.3
6 (0
.31)
0.49
760
7S2
69Q
04T
Ear
th’s
Tem
pera
ture
Cit
oD
utch
S235
.30
(0.3
1)0.
712
629
S270
Q03
TO
zone
Bel
gium
Fren
chS4
56.4
2 (0
.34)
-0.2
8752
9
Ap
pen
dic
es
267
As part of the field trial, informal validity studieswere conducted in Canada, the Czech Republic,France and the United Kingdom. These studiesaddressed the question of how well 15-year-oldscould be expected to report the occupations oftheir parents.
Canada
The Canadian study found the followingcorrelation results:• between achievement and mother’s occupation
reported by student: 0.24;• between achievement and mother’s occupation
reported by mother in 80 per cent of casesand father in 20 per cent of cases: 0.20;
• between achievement and father’s occupationreported by student: 0.21; and
• between achievement and father’s occupationreported by father in 20 per cent of cases andmother in 80 per cent of cases: 0.29. (It should be noted that 80 per cent of theresponses here are proxy data.)
Czech Republic
The correlations between student and parent-reported parental occupational status (SEI) in thestudy in the Czech Republic were:• for mothers 0.87; and• for fathers 0.85.
France
The extent of agreement between the students’and the parents’ reports of the parents’occupations in France, grouped by ISCO MajorGroup, is shown in Table 88. This table couldcall the accuracy of student’s reports of theirparents’ occupations into question. However,two important factors should be noted. First, thetable includes non-response, which necessarilydecreases the level of agreement. Non-responseby parents ranged from 2 per cent to over 10 percent for the Manager group. Second, theresponses not in agreement were clusteredaround the main diagonal. So, for example,
PIS
A 2
00
0 t
ech
nic
al r
epor
t
268
Appendix 4: The Validity Studies of Occupation andSocio-economic Status (summary)
Table 88: Validity Study Results for France
Mother FatherISCO Major % Agreement Number % Agreement NumberGroup Group with Parent With Parent
1000 Legislators, Senior Officials, Managers 39 26 44 87
2000 Professionals 74 88 69 87
3000 Technicians, Associate Professionals 62 125 50 126
4000 Clerks 69 137 43 35
5000 Service/Sales 63 110 62 38
6000 Agricultural/Primary Industry 75 8 85 39
7000 Craft and Trades 64 17 76 130
8000 Operators, Drivers 67 22 58 66
9000 Elementary Occupations 57 57 43 37
Total 711 711
of students who responded that their father wasin the Craft and Trades group, 76 per cent of thefathers put their occupation in the same categoryand 11 per cent were in the next category(Operators). Of students who indicated thattheir mothers were professionals, their answerswere confirmed by their mothers in 74 per centof cases and another 19 per cent were self-reported para-professionals. Although studentswho indicated that their father’s occupation wasin the manager group had this confirmed bytheir fathers in only 44 per cent of cases, another30 per cent of the fathers were self-reportedprofessionals and para-professionals. Third, therewere very few irreconcilable dis-agreements. Forexample, in the previous example, no fatherwhose child classified him as a Manager self-reported himself in major groups 6 to 9.Similarly, of fathers reported as professionals bytheir children, very few (0.5 per cent) indicatedthat they worked as operatives or unskilledlabourers.
United Kingdom
In the study carried out in the United Kingdom,SEI scores of parents’ occupations weredeveloped from 3-digit ISCO codes.
The correlations based on the SEI scores ofoccupations reported by students and parentswere: • for mothers 0.76; and• for fathers 0.71.
The extent of agreement between the students’and the parents’ reports of the parents’occupations, grouped by ISCO Major Group, isshown in Table 89 for the United Kingdomstudy.
Correlations based on the SEI scoresdeveloped from the three-digit ISCO Codeswere:• between achievement and mother’s occupation
reported by student: 0.32;• between achievement and mother’s self-
reported occupation: 0.25;• between achievement and father’s occupation
reported by student: 0.36; and• between achievement and father’s self-reported
occupation: 0.38.
Ap
pen
dic
es
269
Table 89: Validity Study Results for the United Kingdom
Mother FatherISCO Major % Agreement Number % Agreement NumberGroup Group with Parent With Parent
1000 Legislators, Senior Officials, Managers 53 34 55 71
2000 Professionals 94 69 82 44
3000 Technicians, Associate Professionals 40 25 60 30
4000 Clerks 64 74 50 12
5000 Service/Sales 73 67 69 13
6000 Agricultural/Primary Industry 0 1 0 1
7000 Craft and Trades 20 5 68 41
8000 Operators, Drivers 83 6 71 21
9000 Elementary Occupations 60 20 56 9
Total 309 246
Not all pieces of information gathered by SchoolQuality Monitors (SQMs) or National CentreQuality Monitors (NCQMs) are used in this report.Items considered relevant to data quality and toconducting the testing sessions are included here inthe section on SQM reports. A more extensiverange of information is included in the section onNCQM reports.
School Quality Monitor Reports
A total of 103 SQMs submitted 893 reports thatwere processed by the Consortium. These camefrom 31 national centres (including the twolinguistic regions of Belgium, which are countedas two national centres in these analyses). Theform of the reports from one country was notamenable to data entry and they were thereforenot included in the quality monitoring analyses.
Administration of Tests and Questionnairesin Schools
SQMs were asked to record the time at keypoints before, during and after theadministration of the tests and questionnaires.The times, measured in minutes, were transformedinto duration to the nearest minute for:• Cognitive Blocks One and Two: Time taken
for the first and second parts of theassessment booklet (referred to in theseanalyses as ‘cognitive blocks’) consideredseparately (should each equal 60 minutes ifPISA procedures were completely followed);
• Questionnaire: Time taken for the StudentQuestionnaire;
• Student working time: Time taken for the firstand second cognitive blocks plus thequestionnaire (that is, total time spent bystudents in working on PISA materials). It wasexpected that this would be around 150minutes, excluding breaks between sessions andtime spent by the Test Administrator givinginstructions for conducting the session; and
• Total duration: Time from the entrance of thefirst student to the exit of the last student,which provides a measure of how long PISAtook out of a student’s school day.
Duration of cognitive blocksThe mean duration of each of the first and secondcognitive blocks was 60 minutes. Totals of 97 and93 per cent of schools, respectively, completed thefirst and second blocks in 60±2 minutes. For eachblock, only 1 per cent of schools took more than62 minutes, three of these using 70 minutes forthe first block and two using more than 70 minutes for the second block (longest time was79 minutes). Just over 2 per cent of schools forthe first block and 6 per cent for the second blocktook less than 58 minutes.
There was a group of 20 schools from onecountry that took only 50 minutes to completethe second block, which points to a systematicerror in administration procedures. Follow-up ofother schools that finished early revealed thatthe students had either lost interest or had givenup because they found the test too difficult.Some commented that the students gave upbecause they were unfamiliar with the format ofthe complex multiple-choice items.
In an attempt to explain the times in excess of62 minutes, cross-tabulations were constructedof the time variable with the ‘general disruption’variable (Q28), and with the ‘detection ofdefective booklets’ variable (Q34). None of theschools where the time exceeded 62 minutes wasreported to have had these problems.
Duration of Student Questionnaire sessionThe mean duration of the Student Questionnairesession was 36 minutes. A total of 39 per cent ofschools took 30 minutes to complete thequestionnaire. By 40 minutes, 77 per cent ofschools visited by SQMs had completed thequestionnaire. The minimum time taken tocomplete the session was 17 minutes (at oneschool), and the maximum was 95 minutes (alsoat one school). Figure 82 shows the statistics andhistogram summarising these data.
No school was reported as having thequestionnaire session and the cognitive blocksessions on different days.
Duration of student working timeThis section is concerned with the time students
spent working on PISA materials, excluding breaksbetween sessions and time spent by the TestP
ISA
20
00
tec
hn
ical
rep
ort
270
Appendix 5: The Main Findings from Quality Monitoringof the PISA 2000 Main Study
Administrator giving instructions for the session.It was expected that this time would be around150 minutes. About 15 per cent of schools took150 minutes, and about 24 per cent took less than150 minutes. Thus, the majority of schools (about60 per cent) took longer than was expected al-though only a third took longer than 155 minutesand the percentage drops rapidly beyond 160 min-utes. Most of the additional time was taken up incompleting the Student Questionnaire. The inclu-sion of various national and international optionswith the questionnaire makes it difficult to establishthe reasons for the long duration of the workingperiod for many schools. Figure 83 shows thestatistics and histogram summarising these data.
Ap
pen
dic
es
271
Count Midpoint One symbol equals approximately 8.00 occurrences
1 16 |20 20 |***50 24 |******85 28 |***********
237 32 |******************************150 36 |*******************109 40 |**************58 44 |*******26 48 |***32 52 |****13 56 |**29 60 |****14 64 |**4 68 |*2 72 |0 76 |0 80 |0 84 |0 88 |1 92 |1 96 |
+++++++++++++++++++0 80 160 240 320 400
Histogram frequency
Mean 36.294 Std err .354 Median 35.000Mode 30.000 Std dev 10.222 Variance 104.482
Valid cases 832 Missing cases 61
Figure 82: Duration of Questionnaire Session, Showing Histogram and Associated Summary Statistics
Total duration of PISA in the schoolThe mean duration from the entrance to the testroom to the exit of the last student was 205 min-utes (about three and one-half hours), with amedian of 200 minutes and mode of 195 minutes.There were noteworthy extremes. At one extreme,the entire PISA exercise took 75 minutes—in schoolsusing the SE booklet, who were only required tocomplete one hour of testing. At the other extreme,some schools took more than 300 minutes (overfive hours). These long periods were due toextended breaks between the two parts of thetest, or between the testing and questionnairesessions (in some cases as long as 90 minutes),or both. The very long breaks were mostly takenin only one country.
Preparation for the Assessment
Conditions for the test and questionnairesessions
Generally, the data from the SQM reportssuggest that the preparations for the PISAsessions went well and test room conditionswere adequate. Most of the 7 per cent of caseswhere SQMs commented that conditions wereless than adequate were because the testing roomwas too small.
Introducing the studyA script for introducing PISA to the students wasprovided in the Test Administrator’s Manual.While a small amount of latitude was permittedto suit local circumstances, e.g., in the way
PIS
A 2
00
0 t
ech
nic
al r
epor
t
272
Count Midpoint One symbol equals approximately 8.00 occurrences
0 101 |1 107 |1 113 |2 119 |0 125 |3 131 |
15 137 |**83 143 |**********
235 149 |*****************************195 155 |************************126 161 |****************59 167 |*******39 173 |*****35 179 |****15 185 |**4 191 |*0 197 |0 203 |0 209 |2 215 |0 221 |
+++++++++++0 80 160 240 320 400
Histogram frequency
Mean 155.598 Std err .389 Median 154.000Mode 150.000 Std dev 11.111 Variance 123.445
Valid cases 815 Missing cases 78
Figure 83: Duration of Total Student Working Time, with Histogram and Associated Summary Statistics
encouragement was given to students concerningthe importance of participating in PISA, TAs werereminded to follow the script as closely as feasible.From the SQM reports, it appears that TAs tookthis advice seriously. Almost 80 per cent of thosevisited by SQMs followed the script exactly, withabout 18 per cent making minor adjustmentsand only 2 per cent reported as paraphrasing thescript in a major way. While the number ofmajor adjustments is small (about 20 cases), it isenough to assume that insufficient emphasis mayhave been given to this aspect in the TA training.
Beginning the Testing Sessions
Of more importance is the accuracy with whichthe TAs read the instructions for the first andsecond parts of the testing session. Question 12in the box below pertained to the first part ofthe session.
Almost all TAs followed the script of theCognitive Blocks section of the manual forgiving the directions, the examples and theanswers to the examples, prior to beginning theactual test. The number of major additions anddeletions would be a concern if they pertained tothese aspects, but, given that they mostly relatedto whether students should stay in the testingroom after finishing and were spread throughabout half the countries, it is unlikely thesedepartures from the script would have had mucheffect on achievement, except possibly in one
country where no script was provided to TAs.Corresponding statistics for the second part of
the testing session showed that a higherpercentage (83) of TAs followed the instructionsverbatim, with smaller numbers of adjustmentsto the script. Numbers of major adjustmentswere about the same as for the first part of thesession. Again, adjustments to the script usuallyinvolved paraphrasing, but this occurred oftenenough that it seems that not enough emphasiswas put during TA training on the need foradhering accurately to the script.
Conducting the Testing Sessions
In assessing how TAs recorded information,monitored the students, and ended the first andsecond cognitive block sessions, the SQMsoverwhelmingly judged that the sessions wereconducted according to PISA procedures.Information on timing and attendance was notrecorded in only five cases. About 5 per cent ofTAs were reported as remaining at their desksduring the session rather than walking around tomonitor students as they worked, but somecommented that the students were so wellbehaved that there was no need to walk around.In all but six instances in the first session and 14in the second, students stopped work promptlywhen the TA said it was time to stop.
At the end of the first and second parts of thetesting session, close to 85 per cent of the TAs
Ap
pen
dic
es
273
Q 12 Did the Test Administrator read the directions, the examples and the answer to the examples fromthe Script of the Cognitive Blocks exactly as it is written?
Yes 705 (79.8%)
No 178 (20.2%) If no, did the Test Administrator make:
Minor additions 136 (15.2%)
Major additions 23 (2.6%)
Minor deletions 81 (9.1%)
Major deletions 11 (1.2%)
Missing 10 (1.1%)
If major additions or deletions were made, please describe:
No explanations were given by the SQMs for the major deletions, but it is likely that they arosewhen Test Administrators paraphrased instructions. Nearly all of the major deletions occurredbecause the Test Administrator said that students could leave when they had finished, whenthe instructions said they should remain in the room. This occurred in 17 different countries.
were reported as having followed the scriptexactly as it was written. Other cases involvedsome paraphrasing of the script (e.g., remindingstudents to complete the Calculator Use questionat the end of their test booklet), and all but avery small number were considered to be minor.Of serious concern, however, is the countrywhere no script was provided to TAs.
The Student Questionnaire Session
Although TAs and their assistants wereencouraged to assist students in answeringquestions in the questionnaire, about the samepercentage as for the cognitive sessions read thescript exactly as it was written and seemed toperceive the session as rule-bound. In about 100cases the TAs paraphrased the script, but in afurther 20 cases the TAs substituted their owninstructions entirely. In some of these instances(both types), the important instructions thatstudents could ask for help if they haddifficulties and that their answers would not beread by school staff were missed. Theseomissions represent a serious deviation fromPISA procedures and are thus a concern inrelation to the accuracy of students’ responses.
Most TAs realised that time constraints for
completing the questionnaire did not need to berigorously imposed. Because of this flexibility, thetiming procedures were less closely monitoredthan they were for the cognitive blocks.
General Questions Concerning theAssessment
This section of the appendix describes generalissues concerning how PISA sessions wereconducted, such as:• frequency of interruptions;• orderliness of the students;• evidence of cheating;• how appropriately the TA answered
questions;• whether any defective books were detected
and whether their replacement was handledappropriately;
• whether any late students were admitted; and• whether students refused to participate.
The SQMs’ responses to some of these areshown in the boxes below and some commentsare made following the boxes.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
274
Q 28 Were there any general disruptions to the session that lasted for more than one minute(e.g., alarms, announcements, changing of classes, etc.)?
Yes 99 (11.1%) If yes, specify.
Most disruptions occurred because of noise—either from bells ringing, musiclessons, the playground, or students changing classes.
No 792 (88.9%)
Missing 2 (0.2%)
Q 29 Generally, were the students orderly and cooperative?
Yes 836 (94.6%)
No 48 (5.4%)
Missing 9 (1.0%)
These data suggest that very high proportionsof students who participated in the PISA testsand questionnaires did so in a positive spirit.Students refused to participate after the sessionhad started in only a small percentage ofsessions. While cheating was reported to haveoccurred in 42 of the observed sessions, it isdifficult to imagine why students would try tocheat given that they knew that they haddifferent test booklets. This item was included inthe SQM observation sheet because there wererelatively high levels of cheating reported in thefield trial. It is unlikely that the cheating offencesreported by the SQMs would have had muchinfluence on country results and rankings,though they possibly affected the results of a fewindividual students.
More than 95 per cent of TAs were reportedas responding to students’ questionsappropriately during the test and questionnaireadministration.
In 28 cases, TAs were observed to admit latestudents after the testing had begun, which wasclearly stated in their manual as unacceptable.Though low in incidence, this represented aviolation of PISA procedures and may not havereceived sufficient emphasis in TA training.
Defective test booklets were detected in closeto 100 (about 11 per cent) of the SQM visits.These were replaced appropriately in half the
cases, but in the other half the correct procedurewas not followed. In five cases, where there were10 or more defective booklets, the TA may nothave had enough spare copies to distributeaccording to the specified procedure, thoughsubsequent investigation showed that in two ofthese cases the booklets were replaced correctly.In the majority of cases, only one or twobooklets were defective. (Items that were missingfrom test booklets or questionnaires were codedas not administered and therefore did not affectresults.)
Summary
The data from the SQMs support the view that,in nearly all of the schools that were visited,PISA test and questionnaire sessions wereconducted in an orderly fashion and conformedto PISA procedures. The students also seem tohave participated with good grace judging by thelow level of disruption and refusal. This givesconfidence that the PISA test data were collectedunder near-optimal conditions.
Interview with the PISA SchoolCo-ordinator
School Quality Monitors conducted an interviewwith the PISA School Co-ordinators to try and
Ap
pen
dic
es
275
Q 36 Did any students refuse to participate in the assessment after the session had begun?
Yes 48 (5.4%)
No 837 (94.6%)
Missing 8 (0.9%)
Q 32 Was there any evidence of students cheating during the assessment session?
Yes 42 (4.7%)
If yes, describe this evidence.
Most of the evidence to do with cheating concerned students exchanging booklets, askingeach other questions and showing each other their answers. Some students checked with theirneighbour to see if they had the same test booklets. Most incidences reported by SchoolQuality Monitors seem to have been relatively minor infringements.
No 847 (95.3%)
Missing 4 (0.4%)
establish problems experienced by schools inconducting PISA, including: • How well the PISA activities went at the
school;• How many schools refused to have a make-up
session;• If there were any problems co-ordinating the
schedule for the assessment;• If there were any problems preparing an
accurate list of age-eligible students;• If there were any problems in making
exclusions of sampled students;• If the students received practice on PISA-like
questions before the PISA assessment; and• How many students refused to participate in
the PISA tests prior to testing.SQMs were advised that they should treat the
interview as a fairly informal ‘conversation’,since it was judged that this would be less likelyto be seen as an imposition on teachers. Thequality of the data from these interviews wasgood, judging by their completeness and detail.
Responses from the interviews aresummarised in Table 90.
The PISA activities appeared to have gonewell in the vast majority of schools visited by theSQMs. The problems that did exist related toclashes with examinations or the curriculum, orlack of time to organise the testing. Given thedemands of the sampling procedures on manyschools, the relatively long time during whichstudents were needed, and the possibility that insome countries the Student Questionnaire couldbe seen as controversial, this result is a verypositive endorsement of the procedures put inplace by National Centres.
Only 10 of the almost 900 schools that SQMsvisited refused to conduct a make-up session ifone was required. Generally, the schools visitedhad no difficulties co-ordinating the schedule forthe assessment. This may suggest that the role ofthe School Co-ordinator was important inhelping to place PISA into the school, and that itsuccessfully achieved its objectives.
It was expected that preparing the list of age-eligible students might be a problem for schools,particularly those without computer-basedrecords. However, only a small minority of theschools visited by the SQMs said that they hadexperienced problems.
It had also been anticipated that excludingsampled students might be a problem forschools. The evidence here shows that only asmall minority of the schools visited by theSQMs had problems. This finding is consistentwith the NCQM reports where only five NPMsanticipated that schools would have problems.(However, it is also possible that exclusions wereseen not to be problematic because they werenot taken seriously. Caution is therefore advisedin interpreting these results.)
Very few schools had used questions similarto those found in PISA for practice before assess-ment day. This is a positive finding since it impliesthat rehearsal effects were probably minimal.
Student refusals prior to the testing day werewidely distributed among schools, with onenotable exception, as shown in Table 91. Furtherinvestigation showed that this school was in acountry where a national option meant that morethan 35 students could be sampled. All otherschools from this country had low levels of refusal.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
276
Table 90: Responses to the School Co-ordinator Interview
Interview question Yes (%) No (%) Missing (%)
In your opinion, did the PISA activities go well at your school? 91 5 4
Were there any problems co-ordinating the schedule for 11 85 4the assessment?
Did you have any problems preparing an accurate list of age- 7 90 3eligible students?
Did you have any problems making exclusions after you 5 90 5received the list of sampled students?
Did the students get any practice on questions like those in 4 91 5PISA before their assessment day?
Did any students refuse to take the test prior to the testing? 22 76 2
Ap
pen
dic
es
277
Staffing
Test AdministratorsThree-quarters of the NPMs thought that it waseasy to find TAs as described by the PISAguidelines. Of the remaining countries, twostated that they would have problems because ofa general lack of qualified people interested inthe job. Two countries reported that budgetconcerns made it difficult to find appropriatepeople. Five countries noted that it wasnecessary to use school staff, and that there was,therefore, a chance that the Test Administratormight be a teacher of some of the students.However, this was not expected to happen often.This is backed up by the responses to Question5, where no country expected to have to useteachers of sampled students routinely.
The National Centre or NPM in 25 countriesdirectly or indirectly trained the TAs. Twocountries contracted this training out. Sometimesthe National Centre would only train regionalstaff, who would then train the TAs workinglocally.
Table 91: Numbers of Students per SchoolRefusing to Participate in PISA Prior to Testing
Number of Students Per School Frequency
1 882 473 184 145 76 77 49 410 211 112 144 1
National Centre Quality MonitorReports
NCQMs interviewed 29 NPMs. This report doesnot include Australia, Japan or the United States,where the Consortium was responsible in wholeor in part for PISA activities. Seven NCQMscollected the data. Most interviews lasted abouttwo hours.
The International Project Centre has established the following criteria for Test Administrators:
It is required that the Test Administrator NOT be the reading, mathematics, or science instructor of any studentsin the sessions he or she will administer for PISA;
It is recommended that the Test Administrator NOT be a member of the staff of any school where he or she willadminister PISA; and
It is preferable that the Test Administrator NOT be a member of the staff of any school in the PISA sample.
Q 4 How easy was it to comply with these criteria? If NOT easy, why not?Easy 20Not Easy 8No Response/NA 1
Q 5 Who will be Test Administrators for the main study?NPM/National Centre staff 2Regional/District 4External contractor 7Teacher of students 0Other school staff 9Other 7
School Quality Monitors
Several countries found it difficult to recruitSQMs primarily because the NPM believed thatthe Consortium had not provided an adequatejob description. Three countries noted that itwas difficult to find interested people that spokeeither English or French, and a further threecountries noted that it was hard to find qualifiedpeople interested in or independent of thenational centre.
MarkersCountries planned to use several sources for recruit-ing markers. Most planned to use teachers or re-tired teachers who were familiar with 15-year-oldsand the subject matter taught to them. Thirteencountries intended to use university students(mainly graduate-level in education). Five countriesplanned to use national centre staff. A few countriesplanned on using scientists or research staff asmarkers.
Administrative Issues
Obtaining permission from schoolsThirteen NPMs indicated that it had beendifficult to obtain schools’ participation in PISAin their country. Four said that money or in-kindcontributions to the school were used toencourage participation. In one country speecheswere made to teacher and parent groups, and inanother a national advertisement to promotePISA was used. Several countries providedbrochures and other information to schools, andat least one country used a public website.Letters of support from Ministries of Educationwere noted to be useful in some instances.Telephone calls and personal meetings withschool and regional staff were also common.
Unnanounced visits by School QualityMonitors
PISA procedures for School Quality Monitors’visits to schools required that the visits beunannounced. This was expected to pose problemsin several countries, but in practice it was aproblem in only three, where unannounced visitsto schools are not permitted.
Security and confidentialityAll countries took the issue of confidentialityseriously. Nearly all required a pledge ofconfidentiality and provided instructions onmaintaining confidentiality and security. Of the
PIS
A 2
00
0 t
ech
nic
al r
epor
t
278
Q 14 How easy was it to find School QualityMonitors?
Easy 15 Not Easy 13 If NOT easy, why not?
Need better job description 6Lack English/French 3Qualified – no interest/time 2Qualified – not independentof national centre 1
The markers of PISA tests:
• must have a good understanding of mid-secondary level mathematics and science OR <test language>; and
• understand secondary level students and the ways that students at this level express themselves.
Q 49 Do you anticipate or did you have any problems recruiting the markers who will meet thesecriteria? If yes, what are these problems?
Yes 8 Seven of these ‘yes’ responses were due to a ‘lack of qualified/available staff’. Two countries cited cost or other factors as reasonsfor having difficulties recruiting markers.
No 19No response/NA 2
Q 50 Who will be the markers (that is, from what pool will the markers be recruited)?First Source Second
Third National Centre staff 0 5 0 Teachers/Professional Educators 25 0 2 University students 3 1 1 Scientists/researchers 1 1 2 Other 0 1 1 No response/NA 0 0 0
29 respondents, 16 intended to send schoolpackages of materials direct to TAs, nine wereplanning to send them to the relevant school,and four had arranged to send them to regionalministry or school inspectors’ premises forcollection by the TA.
Most countries said that the materials couldnot be inventoried without breaking the seal,either because they were wrapped in paper ormore commonly were sealed in boxes withinstructions not to open them until testing.These countries had not followed the suggestionin the NPM Manual that packages be heat-sealed in clear plastic, allowing booklets to becounted without the seal being broken.
Student Tracking FormsStudent Tracking Forms are among the mostimportant administrative tools in PISA. It wasimportant to know if NPMs, TAs and SCs foundthem amenable to use. Replies pointed to a needto simplify the coding scheme for documentingstudent exclusions and to give thought totranslation issues.
Translation and VerificationThe quality of data from a study like PISAdepends considerably on the quality of theinstruments, which in turn depend largely on thequality of translation. It was very important to theproject to know that National Centres were ableto follow the translation guidelines and proceduresappropriately. By and large the translation qualityobjectives were met, as shown by the NPMs’responses to the following questions.Eight countries listed time pressure as the mainproblem with international verification. Two
Ap
pen
dic
es
279
Q 12 What are you doing to keep the materialssecure?
At the National Centre All countries reported goodsecurity including confidentialitypledges, locked rooms/securedbuildings, and instructions aboutconfidentiality and security. All ofthe National Centres were usedto dealing with embargoed andsecure material.
While in the possession of the TestAdministrator
Most countries used staff whowere familiar with test securityand confidentiality. This issue wasalso emphasised at TestAdministrator training.
While in the possession of the SchoolCo-ordinator
In these cases, the material wassealed and the School Co-ordinator given instructions onhandling the material. Materialswere not to be opened until themorning of testing.
Q 17 Does the Student Tracking Form providedby the Consortium meet the needs inyour country?
Yes 22 No 7
If no, how did you modify the form to better meetyour needs?
Three countries indicated that they hadproblems entering the exclusions. Otherproblems mentioned were: problems trans-lating the text into the country’s lang-uage, problems adding a tenth bookletand problems adding additional columns.One country thought the process wastoo complicated in general.
Q 28 Were the guidelines (for translation andverification) clear?
Yes 24 No 1 No response/NA 4
Q 30 How much did the internationalverification add to the workload in themain study? If a lot, explain.
Not much 17 A lot 10 No response/NA 2
Q 33 Did the remarks of the verifiersignificantly improve the national versionfor the field trial? If no, what were theproblems?
Yes 14No 11No response/NA 4
PIS
A 2
00
0 t
ech
nic
al r
epor
t
280
commented on the cost of verification procedures.The 11 NPMs who said that there was nosignificant improvement in the national versionfor the field trial also noted that they hadexperienced few initial problems. No countryfelt that the verifiers had made errors, but atleast one mentioned that they did not alwaysagree with or accept the verifier’s suggestions.
Slightly fewer countries indicated that therewere significant improvements in their nationalversions for the main study. Again, the reason
expressed by ten of the countries whoresponded that significant improvements werenot made was that not many changes weremade in any case. This is what would be hopedafter the experiences of the field trial. However,a small number of countries did indicate someproblems.
Adequacy of the Manuals
One aim of the NCQM visits was to collectsuggestions from NPMs about areas in which thevarious PISA manuals could be improved.Seventeen of the NPMs offered comments, whichare tabulated in the box below.
The format of some of the manuals clearlyneeds to be simplified. A problem for many NPMswas keeping track of and following revisions,especially while translations were in progress.
Q 34 Did the remarks of the verifiersignificantly improve the nationalversion for the main study? If no, whatwere the problems?
Yes 12No 13No response/NA 4
Q 55 Do you have any comments about manuals and other materials at this time?
NPM Manual Thought manual good/adequate = 7Comments offered=17 Concerned about the lateness of receiving manual = 2No comments offered=12 Manual needs to be simplified or made clearer = 8
SC Manual
Comments offered=17 Thought manual good/adequate = 5No comments offered=12 Manual needs to be simplified or made clearer = 11
Manual needs checklist or more examples (especially non-English/French) = 1
TA Manual
Comments offered=16 Thought manual good/adequate=4No comments offered=13 Manual needs to be simplified or made clearer=11
Manual needs checklist or more examples (especially non-English/French) = 1
Marking Guide
Comments offered=16 Thought Guide good/adequate = 11No comments offered=13 Guide needs to be simplified or made clearer = 2
Guide needs checklist or more examples (especially non-English/French) = 1Other comments = 2
Sampling Manual
Comments offered=16 Thought manual good/adequate = 12No comments offered=13 Manual needs to be simplified or made clearer = 4
Translation Guidelines
Comments offered=11 Thought Guidelines good/adequate = 10No comments offered18 Guidelines need to be simplified or made clearer = 1
Other
Comments offered=6 Thought good/adequate = 2No comments offered=23 Manual needs to be simplified or made clearer = 4
(Other comments referred to the KeyQuest Manual and were noted earlierin the report in the section on the Student Tracking Form and sampling.)
Ap
pen
dic
es
281
Conclusion
In general, the NCQM reports point to a strongorganisational base for the conduct of PISA.National Centres were, in the main, wellorganised and had a good understanding of thefollowing aspects:• appointing and training Test Administrators;• appointing SQMs;• organising for marking and appointing markers;• security of PISA materials;• the Student Tracking Form and sampling
issues; and• translation (where appropriate) and verification
of materials, although concerns were expressedabout how much work was involved and somedisagreement with decisions made during thisprocess was indicated.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
282
Appendix 6: Order Effects Figures
Table 92: Item Assignment to Reading Clusters
Cluster
No. 1 2 3 4 5 6 7 8 9
1 R219Q01T R245Q01 R091Q05 R227Q01 R093Q03 R061Q01 R228Q01 R110Q01 R077Q02
2 R219Q01E R245Q02 R091Q06 R227Q02T R055Q01 R061Q03 R228Q02 R110Q04 R077Q03
3 R219Q02 R225Q02 R091Q07B R227Q03 R055Q02 R061Q04 R228Q04 R110Q05 R077Q04
4 R081Q01 R225Q03 R119Q09T R227Q04 R055Q03 R061Q05 R099Q04B R110Q06 R077Q05
5 R081Q05 R225Q04 R119Q01 R227Q06 R055Q05 R083Q01 R120Q01 R237Q01 R077Q06
6 R081Q06A R238Q01 R119Q07 R086Q05 R122Q02 R083Q02 R120Q03 R237Q03 R216Q01
7 R081Q06B R238Q02 R119Q06 R086Q07 R122Q03T R083Q03 R120Q06 R236Q01 R216Q02
8 R101Q01 R234Q01 R119Q08 R086Q04 R067Q01 R083Q04 R120Q07T R236Q02 R216Q03T
9 R101Q02 R234Q02 R119Q04 R102Q01 R067Q04 R083Q06 R220Q01 R246Q01 R216Q04
10 R101Q03 R241Q02 R119Q05 R102Q04A R067Q05 R100Q04 R220Q02B R246Q02 R216Q06
11 R101Q04 R102Q05 R076Q03 R100Q05 R220Q04 R239Q01 R088Q01
12 R101Q05 R102Q06 R076Q04 R100Q06 R220Q05 R239Q02 R088Q03
13 R101Q08 R102Q07 R076Q05 R100Q07 R220Q06 R088Q04T
14 R070Q02 R111Q01 R104Q01 R088Q05T
15 R070Q03 R111Q02B R104Q02 R088Q07
16 R070Q04 R111Q04 R104Q05 R040Q02
17 R070Q07T R111Q06B R104Q06 R040Q03A
18 R040Q03B
19 R040Q04
20 R040Q06
Table 92 shows the assignment of items to each ofthe nine reading clusters. Figures 84 to 92 representthe complete set of plots of order effects in thebooklets. There is one figure for each cluster. In thelegend, the first digit after ‘Pos.’ shows whether the
cluster was in the first or second half of thebooklet, and the second digit shows the cluster’sposition in the booklet half. For example, Pos. 2.1means that the cluster was the first one in thesecond half of the booklet indicated in brackets.
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Item
Par
amet
er D
iffe
renc
es
Pos. 1.1 (book 1) Pos 1.2 (book 5) Pos. 2.1 (book 7)
Item
Figure 84: ItemParameter Differencesfor the Items inCluster One
Ap
pen
dic
es
283
-0.4-0.3-0.2-0.1
00.10.20.30.4
1 2 3 4 5 6 7 8 9 10Item
Pos. 1.1 (book 2) Pos 1.2 (book 1) Pos. 2.1 (book 6)
Item
Par
amet
er D
iffe
renc
es
Figure 85: ItemParameter Differencesfor the Items inCluster Two
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
1 2 3 4 5 6 7 8 9 10Item
Pos. 1.1 (book 3) Pos 1.2 (book 2) Pos. 2.1 (book 7)
Item
Par
amet
er D
iffe
renc
es
-0.4
-0.3-0.2
-0.1
0
0.10.2
0.3
0.4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17Item
Pos. 1.1 (book 4) Pos 1.2 (book 3) Pos. 2.1 (book 1)
Item
Par
amet
er D
iffer
ence
s
Figure 86: ItemParameter Differencesfor the Items inCluster Three
Figure 87: ItemParameter Differencesfor the Items inCluster Four
284
Figure 88: ItemParameter Differencesfor the Items inCluster Five
Figure 89: ItemParameter Differencesfor the Items inCluster Six
Figure 90: ItemParameter Differencesfor the Items inCluster Seven
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12 13
Item
Pos. 1.1 (book 5) Pos 1.2 (book 4) Pos. 2.1 (book 2)
Item
Par
amet
er D
iffe
renc
es
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17Item
Pos. 1.1 (book 6) Pos 1.2 (book 5) Pos. 2.1 (book 3)
Item
Par
amet
er D
iffe
renc
es
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
1 2 3 4 5 6 7 8 9 10 11 12 13Item
Pos. 1.1 (book 7) Pos 1.2 (book 6) Pos. 2.1 (book 4)
Item
Par
amet
er D
iffe
renc
es
PIS
A 2
00
0 t
ech
nic
al r
epor
t
Ap
pen
dic
es
285
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
1 2 3 4 5 6 7 8 9 10 11 12Item
Pos. 2.1 (book 8) Pos 2.2 (book 7) Pos. 2.2 (book 9)
Item
Par
amet
er D
iffe
renc
e
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Item
Pos. 2.1 (book 9) Pos 2.2 (book 8)
Item
Par
amet
er D
iffe
renc
e
Figure 91: ItemParameter Differencesfor the Items inCluster Eight
Figure 92: ItemParameter Differencesfor the Items inCluster Nine
PISA Consortium
Australian Council for EducationalResearch
Ray Adams (Project Director of the PISA Consortium)
Alla Berezner (data processing, data analysis)
Claus Carstensen (data analysis)
Lynne Darkin (reading test development)
Brian Doig (mathematics test development)
Adrian Harvey-Beavis (quality monitoring,questionnaire development)
Kathryn Hill (reading test development)
John Lindsey (mathematics test development)
Jan Lokan (quality monitoring, field proceduresdevelopment)
Le Tu Luc (data processing)
Greg Macaskill (data processing)
Joy McQueen (reading test development and reporting)
Gary Marks (questionnaire development)
Juliette Mendelovits (reading test development andreporting)
Christian Monseur (Director of the PISA Consortiumfor data processing, data analysis, quality monitoring)
Gayl O’Connor (science test development)
Alla Routitsky (data processing)
Wolfram Schulz (data analysis)
Ross Turner (test analysis and reporting co-ordination)
Nikolai Volodin (data processing)
Craig Williams (data processing, data analysis)
Margaret Wu (Deputy Project Director of the PISAConsortium)
Westat
Nancy Caldwell (Director of the PISA Consortiumfor field operations and quality monitoring)
Ming Chen (sampling and weighting)
Fran Cohen (sampling and weighting)
Susan Fuss (sampling and weighting)
Brice Hart (sampling and weighting)
Sharon Hirabayashi (sampling and weighting)
Sheila Krawchuk (sampling and weighting)
Dward Moore (field operations and quality monitoring)
Phu Nguyen (sampling and weighting)
Monika Peters (field operations and quality monitoring)
Merl Robinson (field operations and quality monitoring)
Keith Rust (Director of the PISA Consortium forsampling and weighting)
Leslie Wallace (sampling and weighting)
Dianne Walsh (field operations and quality monitoring)
Trevor Williams (questionnaire development)
Citogroep
Steven Bakker (science test development)
Bart Bossers (reading test development)
Truus Decker (mathematics test development)
Erna van Hest (reading test development andquality monitoring)
Kees Lagerwaard (mathematics test development)
Gerben van Lent (mathematics test development)
Ico de Roo (science test development)
Maria van Toor (office support and qualitymonitoring)
Norman Verhelst (technical advice, data analysis)
Educational Testing Service
Irwin Kirsch (reading test development)
Other experts
Marc Demeuse (quality monitoring)
Harry Ganzeboom (questionnaire development)
Aletta Grisay (technical advice, data analysis,translation, questionnaire development)
Donald Hirsch (editorial review)
Erich Ramseier (questionnaire development)
Marie-Andrée Somers (data analysis and reporting)
Peter Sutton (editorial review)
Rich Tobin (questionnaire development and reporting)
J. Douglas Willms (questionnaire development,data analysis and reporting)
Reading FunctionalPIS
A 2
00
0 t
ech
nic
al r
epor
t
286
Appendix 7: PISA 2000 Expert Group Membership andConsortium Staff
Expert Group
Irwin Kirsch (Chair) (Educational Testing Service,United States)
Marilyn Binkley (National Center for EducationalStatistics, United States)
Alan Davies (University of Edinburgh, UnitedKingdom)
Stan Jones (Statistics Canada, Canada)
John de Jong (Language Testing Services, TheNetherlands)
Dominique Lafontaine (Université de Liège SartTilman, Belgium)
Pirjo Linnakylä (University of Jyväskylä, Finland)
Martine Rémond (Institut National de RecherchePédagogique, France)
Ryo Watanabe (National Institute for EducationalResearch, Japan)
Prof. Wolfgang Schneider (University of Würzburg,Germany)
Mathematics FunctionalExpert Group
Jan de Lange (Chair) (Utrecht University, TheNetherlands)
Raimondo Bolletta (Istituto Nazionale di Valutazione,Italy)
Sean Close (St Patrick’s College, Ireland)
Maria Luisa Moreno (IES “Lope de Vega”, Spain)
Mogens Niss (IMFUFA, Roskilde University,Denmark)
Kyungmee Park (Hongik University, Korea)
Thomas A. Romberg (United States)
Peter Schüller (Federal Ministry of Education andCultural Affairs, Austria)
Science Functional ExpertGroup
Wynne Harlen (Chair) (University of Bristol,United Kingdom)
Peter Fensham (Monash University, Australia)
Raul Gagliardi (University of Geneva, Switzerland)
Svein Lie (University of Oslo, Norway)
Manfred Prenzel (Universität Kiel, Germany)
Senta Raizen (National Center for ImprovingScience Education (NCISE), United States)
Donghee Shin (Dankook University, Korea)
Elizabeth Stage (University of California, United States)
Cultural Review Panel
Ronald Hambleton (Chair) (University ofMassachusetts, United States)
David Bartram (SHL Group plc, United Kingdom)
Sunhee Chae (KICE, Korea)
Chiara Croce (Ministero della Pubblica Istruzione,Direzione Generale degli Scambi Culturali, Italy)
John de Jong (Language Testing Services, TheNetherlands)
Jose Muniz (University of Oviedo, Spain)
Yves Poortinga (Tilburg University, TheNetherlands)
Norbert Tanzer (University of Graz, Austria)
Pierre Vrignaud (INETOP - Institut National pourl’Enseignement Technique et l’OrientationProfessionnelle, France)
Ingemar Wedman (University of Stockholm,Sweden)
Technical Advisory Group
Geoff Masters (Chair) (ACER, Australia)
Ray Adams (ACER, Australia)
Pierre Foy (Statistics Canada, Canada)
Aletta Grisay (Belgium)
Larry Hedges (The University of Chicago, UnitedStates)
Eugene Johnson (American Institutes for Research,United States)
John de Jong (Language Testing Services, TheNetherlands)
Keith Rust (WESTAT, USA)
Norman Verhelst (Cito group, The Netherlands)
J. Douglas Willms (University of New Brunswick,Canada)
Ap
pen
dic
es
287
PIS
A 2
00
0 t
ech
nic
al r
epor
t
288
Appendix 8: Contrast Coding for PISA-2000Conditioning Variables
Conditioning Variables Variable Name (s) Variable Coding Contrast Coding
Sex - Q3 ST03Q01 1 <Female> -12 <Male> 1Missing 0
Mother - Q4a ST04Q01 1 Yes 10 2 No 01 Missing 00
Female Guardian - Q4b ST04Q02 1 Yes 10 2 No 01 Missing 00
Father - Q4c ST04Q03 1 Yes 10 2 No 01 Missing 00
Male Guardian - Q4d ST04Q04 1 Yes 10 2 No 01 Missing 00
Brothers - Q4e ST04Q05 1 Yes 10 2 No 01 Missing 00
Sisters - Q4f ST04Q06 1 Yes 10 2 No 01 Missing 00
Grandparents - Q4g ST04Q07 1 Yes 10 2 No 01 Missing 00
Others - Q4h ST04Q08 1 Yes 10 2 No 01 Missing 00
Older - Q5a ST05Q01 1 None 10 2 One 20 3 Two 30 4 Three 40 5 Four or more 50 Missing 11
Younger - Q5b ST05Q02 1 None 10 2 One 20 3 Two 30 4 Three 40 5 Four or more 50 Missing 11
Same age - Q5c ST05Q03 1 None 10 2 One 20 3 Two 30 4 Three 40 5 Four or more 50 Missing 11
Ap
pen
dic
es
289
Conditioning Variables Variable Name (s) Variable Coding Contrast Coding
Mother currently doing - Q6 ST06Q01 1 Working full-time <for pay> 10002 Working part-time <for pay> 01003 Not working, but looking... 00104 Other (e.g., home duties, retired) 0001Missing 0000
Father currently doing - Q7 ST07Q01 1 Working full-time <for pay> 10002 Working part-time <for pay> 01003 Not working, but looking... 00104 Other (e.g., home duties, retired) 0001Missing 0000
Mother’s secondary educ ST12Q01 1 No, she did not go to school 10000 - Q12 2 No, she completed <ISCED
level 1> only 01000 3 No, she completed <ISCED level 2> only 00100 4 No, she completed <ISCED level 3B, 3C> only 00010 5 Yes, she completed <ISCED 3A> 00001 Missing 00000
Father’s secondary educ ST13Q01 1 No, he did not go to school 10000 - Q13 2 No, he completed <ISCED 1>
1> only 01000 3 No, he completed <ISCED 2> only 00100 4 No, he completed <ISCED 3B, 3C> only 00010 5 Yes, he completed <ISCED 3A> 00001 Missing 00000
Mother’s tertiary educ - Q14 ST14Q01 1 Yes 10 2 No 01 Missing 00
Father’s tertiary educ - Q15 ST15Q01 1 Yes 10 2 No 01 Missing 00
Country of birth, self - Q16a ST16Q01 1 <Country of Test> 10 2 Another Country 01 Missing 00
Country of birth, Mother ST16Q02 1 <Country of Test> 10 - Q16b 2 Another Country 01
Missing 00 Country of birth, Father ST16Q03 1 <Country of Test> 10 - Q16c 2 Another Country 01
Missing 00
Conditioning Variables Variable Name (s) Variable Coding Contrast Coding
Language at home - Q17 ST17Q01 1 <Test language> 10002 <Other official national
languages> 01003 <Other national dialects or
languages> 00104 <Other languages> 0001Missing 0000
Movies - Q18a ST18Q01 1 Never or hardly ever 10 2 Once or twice a year 20 3 About 3 or 4 times a year 30 4 More than 4 times a year 40 Missing 41
Art gallery - Q18b ST18Q02 1 Never or hardly ever 10 2 Once or twice a year 20 3 About 3 or 4 times a year 30 4 More than 4 times a year 40 Missing 11
Pop Music - Q18c ST18Q03 1 Never or hardly ever 10 2 Once or twice a year 20 3 About 3 or 4 times a year 30 4 More than 4 times a year 40 Missing 11
Opera - Q18d ST18Q04 1 Never or hardly ever 10 2 Once or twice a year 20 3 About 3 or 4 times a year 30 4 More than 4 times a year 40 Missing 11
Theatre - Q18e ST18Q05 1 Never or hardly ever 10 2 Once or twice a year 20 3 About 3 or 4 times a year 30 4 More than 4 times a year 40 Missing 11
Sport - Q18f ST18Q06 1 Never or hardly ever 10 2 Once or twice a year 20 3 About 3 or 4 times a year 30 4 More than 4 times a year 40 Missing 41
Discuss politics - Q19a ST19Q01 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11
PIS
A 2
00
0 t
ech
nic
al r
epor
t
290
Conditioning Variables Variable Name (s) Variable Coding Contrast Coding
Discuss books - Q19b ST19Q02 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 51
Listen classics - Q19c ST19Q03 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11
Discuss school issues ST19Q04 1 Never or hardly ever 10 - Q19d 2 A few times a year 20
3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 51
Eat <main meal> - Q19e ST19Q05 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 51
Just talking - Q19f ST19Q06 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 51
Mother - Q20a ST20Q01 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11
Father - Q20b ST20Q02 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11
Ap
pen
dic
es
291
PIS
A 2
00
0 t
ech
nic
al r
epor
t
292
Conditioning Variables Variable Name (s) Variable Coding Contrast Coding
Siblings - Q20c ST20Q03 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11
Grandparents - Q20d ST20Q04 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11
Other relations - Q20e ST20Q05 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11
Parents’ friends - Q20f ST20Q06 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11
Dishwasher - Q21a ST21Q01 1 Yes 10 2 No 01 Missing 00
Own room - Q21b ST21Q02 1 Yes 10 2 No 01 Missing 00
Educat software - Q21c ST21Q03 1 Yes 10 2 No 01 Missing 00
Internet - Q21d ST21Q04 1 Yes 10 2 No 01 Missing 00
Dictionary - Q21e ST21Q05 1 Yes 10 2 No 01 Missing 00
Study place - Q21f ST21Q06 1 Yes 10 2 No 01 Missing 00
Desk - Q21g ST21Q07 1 Yes 10 2 No 01 Missing 00
Ap
pen
dic
es
293
Conditioning Variables Variable Name (s) Variable Coding Contrast Coding
Text books - Q21h ST21Q08 1 Yes 10 2 No 01 Missing 00
Classic literature - Q21i ST21Q09 1 Yes 10 2 No 01 Missing 00
Poetry - Q21j ST21Q10 1 Yes 10 2 No 01 Missing 00
Art works - Q21k ST21Q11 1 Yes 10 2 No 01 Missing 00
Phone - Q22a ST22Q01 1 None 10 2 One 20 3 Two 30 4 Three or more 40 Missing 21
Television - Q22b ST22Q02 1 None 10 2 One 20 3 Two 30 4 Three or more 40 Missing 41
Calculator - Q22c ST22Q03 1 None 10 2 One 20 3 Two 30 4 Three or more 40 Missing 41
Computer - Q22d ST22Q04 1 None 10 2 One 20 3 Two 30 4 Three or more 40 Missing 21
Musical instruments - Q22e ST22Q05 1 None 10 2 One 20 3 Two 30 4 Three or more 40 Missing 11
Car - Q22f ST22Q06 1 None 10 2 One 20 3 Two 30 4 Three or more 40 Missing 21
Bathroom - Q22g ST22Q07 1 None 10 2 One 20 3 Two 30 4 Three or more 40 Missing 21
PIS
A 2
00
0 t
ech
nic
al r
epor
t
294
Conditioning Variables Variable Name (s) Variable Coding Contrast Coding
<Extension> - Q23a ST23Q01 1 No, never 10 2 Yes, sometimes 20 3 Yes, regularly 30 Missing 11
<Remedial> in <test lang> ST23Q02 1 No, never 10 - Q23b 2 Yes, sometimes 20
3 Yes, regularly 30 Missing 11
<Remedial> in other subjects ST23Q03 1 No, never 10 - Q23c 2 Yes, sometimes 20
3 Yes, regularly 30 Missing 11
Skills training - Q23d ST23Q04 1 No, never 10 2 Yes, sometimes 20 3 Yes, regularly 30 Missing 11
In <test language> - Q24a ST24Q01 1 No, never 10 2 Yes, sometimes 20 3 Yes, regularly 30 Missing 11
In other subjects - Q24b ST24Q02 1 No, never 10 2 Yes, sometimes 20 3 Yes, regularly 30 Missing 11
<Extension> - Q24c ST24Q03 1 No, never 10 2 Yes, sometimes 20 3 Yes, regularly 30 Missing 11
<Remedial> in<test language> ST24Q04 1 No, never 10 - Q24d 2 Yes, sometimes 20
3 Yes, regularly 30 Missing 11
<Remedial> in other subjects ST24Q05 1 No, never 10 - Q24e 2 Yes, sometimes 20
3 Yes, regularly 30 Missing 11
Skills training - Q24f ST24Q06 1 No, never 10 2 Yes, sometimes 20 3 Yes, regularly 30 Missing 11
<Private tutoring> - Q24g ST24Q07 1 No, never 10 2 Yes, sometimes 20 3 Yes, regularly 30 Missing 11
Ap
pen
dic
es
295
Conditioning Variables Variable Name (s) Variable Coding Contrast Coding
School program - Q25 ST25Q01 1 <ISCED 2A> 1000002 <ISCED 2B> 0100003 <ISCED 2C> 0010004 <ISCED 3A> 0001005 <ISCED 3B> 0000106 <ISCED 3C> 000001Missing 000000
Teachers wait long time ST26Q01 1 Never 10 - Q26a 2 Some lessons 20
3 Most lessons 30 4 Every lesson 40 Missing 21
Teachers want students to ST26Q02 1 Never 10 work - Q26b 2 Some lessons 20
3 Most lessons 30 4 Every lesson 40 Missing 41
Teachers tell students do ST26Q03 1 Never 10 better - Q26c 2 Some lessons 20
3 Most lessons 30 4 Every lesson 40 Missing 21
Teachers don’t like - Q26d ST26Q04 1 Never 10 2 Some lessons 20 3 Most lessons 30 4 Every lesson 40 Missing 21
Teachers show interest ST26Q05 1 Never 10 - Q26e 2 Some lessons 20
3 Most lessons 30 4 Every lesson 40 Missing 41
Teachers give opportunity ST26Q06 1 Never 10 - Q26f 2 Some lessons 20
3 Most lessons 30 4 Every lesson 40 Missing 41
Teachers help with work ST26Q07 1 Never 10 - Q26g 2 Some lessons 20
3 Most lessons 30 4 Every lesson 40 Missing 41
Teachers continue teaching ST26Q08 1 Never 10 - Q26h 2 Some lessons 20
3 Most lessons 30 4 Every lesson 40 Missing 31
PIS
A 2
00
0 t
ech
nic
al r
epor
t
296
Conditioning Variables Variable Name (s) Variable Coding Contrast Coding
Teachers do a lot to ST26Q09 1 Never 10 help - Q26i 2 Some lessons 20
3 Most lessons 30 4 Every lesson 40 Missing 31
Teachers help with learning ST26Q10 1 Never 10 - Q26j 2 Some lessons 20
3 Most lessons 30 4 Every lesson 40 Missing 31
Teachers check homework ST26Q11 1 Never 10 - Q26k 2 Some lessons 20
3 Most lessons 30 4 Every lesson 40 Missing 21
Students cannot work well ST26Q12 1 Never 10 - Q26l 2 Some lessons 20
3 Most lessons 30 4 Every lesson 40 Missing 21
Students don’t listen - Q26m ST26Q13 1 Never 10 2 Some lessons 20 3 Most lessons 30 4 Every lesson 40 Missing 21
Students don’t start - Q26n ST26Q14 1 Never 10 2 Some lessons 20 3 Most lessons 30 4 Every lesson 40 Missing 21
Students learn a lot - Q26o ST26Q15 1 Never 10 2 Some lessons 20 3 Most lessons 30 4 Every lesson 40 Missing 31
Noise & disorder - Q26p ST26Q16 1 Never 10 2 Some lessons 20 3 Most lessons 30 4 Every lesson 40 Missing 21
Doing nothing - Q26q ST26Q17 1 Never 10 2 Some lessons 20 3 Most lessons 30 4 Every lesson 40 Missing 21
Ap
pen
dic
es
297
Conditioning Variables Variable Name (s) Variable Coding Contrast Coding
Number in < test language> ST27Q01 1 010- Q27a 2 020
… …90 900Missing 041
Usual in <test lang> ST27Q02 1 Yes 10- Q27aa 2 No 01
Missing 00Number in Mathematics ST27Q03 1 010
- Q27b 2 020… …90 900Missing 041
Usual in Mathematics ST27Q04 1 Yes 10- Q27bb 2 No 01
Missing 00Number in Science - Q27c ST27Q05 1 010
2 020… …90 900Missing 041
Usual in Science - Q27cc ST27Q06 1 Yes 102 No 01Missing 00
In < test language> - Q28a ST28Q01 1 0102 020… …90 900Missing 251
In Mathematics - Q28b ST28Q02 1 0102 020… …90 900Missing 241
In Science - Q28c ST28Q03 1 0102 020… …90 900Missing 241
Miss school - Q29a ST29Q01 1 None 10 2 1 or 2 20 3 3 or 4 30 4 5 or more 40 Missing 11
PIS
A 2
00
0 t
ech
nic
al r
epor
t
298
Conditioning Variables Variable Name (s) Variable Coding Contrast Coding
<Skip> classes - Q29b ST29Q02 1 None 10 2 1 or 2 20 3 3 or 4 30 4 5 or more 40 Missing 11
Late for school - Q29c ST29Q03 1 None 10 2 1 or 2 20 3 3 or 4 30 4 5 or more 40 Missing 11
Well with teachers - Q30a ST30Q01 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31
Interested in students - Q30b ST30Q02 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31
Listen to me - Q30c ST30Q03 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31
Give extra help - Q30d ST30Q04 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31C
Treat me fairly - Q30e ST30Q05 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31
Feel an outsider - Q31a ST31Q01 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 11
Make friends - Q31b ST31Q02 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31
Ap
pen
dic
es
299
Conditioning Variables Variable Name (s) Variable Coding Contrast Coding
Feel I belong - Q31c ST31Q03 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31
Feel awkward - Q31d ST31Q04 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 11
Seem to like me - Q31e ST31Q05 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31
Feel lonely - Q31f ST31Q06 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 11
Don’t want to be - Q31g ST31Q07 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 21
Feel Bored - Q31h ST31Q08 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 21
I complete on time - Q32a ST32Q01 1 Never 10 2 Sometimes 20 3 Most of the time 30 4 Always 40 Missing 31
I do watching TV - Q32b ST32Q02 1 Never 10 2 Sometimes 20 3 Most of the time 30 4 Always 40 Missing 21
Teachers grade - Q32c ST32Q03 1 Never 10 2 Sometimes 20 3 Most of the time 30 4 Always 40 Missing 21
PIS
A 2
00
0 t
ech
nic
al r
epor
t
300
Conditioning Variables Variable Name (s) Variable Coding Contrast Coding
I finish at school - Q32d ST32Q04 1 Never 10 2 Sometimes 20 3 Most of the time 30 4 Always 40 Missing 21
Teachers comment on ST32Q05 1 Never 10 - Q32e 2 Sometimes 20
3 Most of the time 30 4 Always 40 Missing 21
Is interesting - Q32f ST32Q06 1 Never 10 2 Sometimes 20 3 Most of the time 30 4 Always 40 Missing 21
Is counted in <mark> ST32Q07 1 Never 10 - Q32g 2 Sometimes 20
3 Most of the time 30 4 Always 40 Missing 21
Homework <test language> ST33Q01 1 No time 10 - Q33a 2 Less than 1 hour a week 20
3 Between 1 and 3 hours a week 30 4 3 hours or more a week 40 Missing 31
Homework <maths> - Q33b ST33Q02 1 No time 10 2 Less than 1 hour a week 20 3 Between 1 and 3 hours a week 30 4 3 hours or more a week 40 Missing 31
Homework <science> - Q33c ST33Q03 1 No time 10 2 Less than 1 hour a week 20 3 Between 1 and 3 hours a week 30 4 3 hours or more a week 40 Missing 31
Read each day - Q34 ST34Q01 1 I do not read for enjoyment 10 2 30 minutes or less each day 20 3 More than 30 minutes to less 30
than 60 minutes each day 4 1 to 2 hours each day 40 5 More than 2 hours each day 50 Missing 21
Only if I have to - Q35a ST35Q01 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 21
Ap
pen
dic
es
301
Conditioning Variables Variable Name (s) Variable Coding Contrast Coding
Favourite hobby - Q35b ST35Q02 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 21
Talking about books - Q35c ST35Q03 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 21
Hard to finish - Q35d ST35Q04 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 21
Feel happy - Q35e ST35Q05 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31
Waste of time - Q35f ST35Q06 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 21
Enjoy library - Q35g ST35Q07 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 31
For information - Q35h ST35Q08 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 21
Few minutes only - Q35i ST35Q09 1 Strongly disagree 10 2 Disagree 20 3 Agree 30 4 Strongly agree 40 Missing 21
Magazines - Q36a ST36Q01 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 51
PIS
A 2
00
0 t
ech
nic
al r
epor
t
302
Conditioning Variables Variable Name (s) Variable Coding Contrast Coding
Comics - Q36b ST36Q02 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11
Fiction - Q36c ST36Q03 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 21
Non-fiction - Q36d ST36Q04 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11
E-mail & Web - Q36e ST36Q05 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 31
Newspapers - Q36f ST36Q06 1 Never or hardly ever 10 2 A few times a year 20 3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 51
How many books at home - Q37 ST37Q01 1 None 10 2 1-10 books 20 3 11-50 books 30 4 51-100 books 40 5 101-250 books 50 6 251-500 books 60 9 More than 500 books 70Missing 51
Borrow books - Q38 ST38Q01 1 Never or hardly ever 10 2 A few times per year 20 3 About once a month 30 4 Several times a month 40 Missing 11
Ap
pen
dic
es
303
Conditioning Variables Variable Name (s) Variable Coding Contrast Coding
How often use school ST39Q01 1 Never or hardly ever 10 library - Q39a 2 A few times a year 20
3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11
How often use computers ST39Q02 1 Never or hardly ever 10 - Q39b 2 A few times a year 20
3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 21
How often use calculators ST39Q03 1 Never or hardly ever 10 - Q39c 2 A few times a year 20
3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 51
How often use Internet ST39Q04 1 Never or hardly ever 10 - Q39d 2 A few times a year 20
3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11
How often use science labs ST39Q05 1 Never or hardly ever 10 - Q39e 2 A few times a year 20
3 About once a month 30 4 Several times a month 40 5 Several times a week 50 Missing 11
Calculator Use Caluse 1 ‘No calculator’ 102 ‘A simple calculator’ 203 ‘A scientific calculator’ 304 ‘A programmable calculator’ 405 ‘A graphics calculator’ 50Missing 11
Booklet BookID 0 0000000001 1000000002 0100000003 0010000004 0001000005 0000100006 0000010007 0000001008 0000000109 000000001
PIS
A 2
00
0 t
ech
nic
al r
epor
t
304
Conditioning Variables Variable Name (s) Variable Coding Contrast Coding
Mother’s main job BMMJ 1 0102 020…90 900missing 500
Father’s main job BFMJ 1 0102 02090 900missing 501
Multi-column entries without over-bars indicate multiple contrasts. Barred columns are treated as one contrast.
Appendix 9: Sampling Forms
PISA Sampling Form 1 Time of Testing and Age Definition
See Section 3.1 of School Sampling Manual.
PISA Participant:
National Project Manager:
1. Beginning and ending dates of assessment 2000 to 2000
2. Please confirm that the assessment start date is after the first three months of the academic year.
Yes
No
3. Students who will be assessed were born between and D D M M Y Y D D M M Y Y
4. As part of of the PISA sampling process for your country, will you be selecting students other thanthose born between the dates in 3) above? (For example students in a particular grade, no matter whatage.)
No
Yes (please describe the additional population):
Ap
pen
dic
es
305
PISA Sampling Form 2 National Desired Target Population
See Section 3.2 of School Sampling Manual.
PISA Participant:
National Project Manager:
1, Total national population of 15-year-olds: [a]
2. Total national population of 15-year-olds enrolled in educational institutions: [b]
3. Describe the population(s) to be omitted from the national desired target population (if applicable).
4. Total enrolment omitted from the national desired target population [c](corresponding to the omissions listed in the previous item):
5. Total enrolment in the national desired target population: [d]box [b] - box [c]
6. Percentage of coverage in the national desired target population: [e](box [d] / box [b]) x 100
7. Describe your data source (Provide copies of relevant tables):
PIS
A 2
00
0 t
ech
nic
al r
epor
t
306
PISA Sampling Form 3 National Defined Target Population
See Section 3.3 of School Sampling Manual.
PISA Participant:
National Project Manager:
1. Total enrolment in the national desired target population: [a]From box [d] on Sampling Form 2
2. School-level exclusions:
Description of exclusions Number of schools Number of students
TOTAL [b]
Percentage of students not covered due to school-level exclusions: %(box [b] / box [a]) x 100
3. Total enrolment in national defined target population before within-school [c]exclusions: box [a] - box [b]
4. Within-school exclusions:
Description of exclusions Expectednumber of students
TOTAL [d]
Expected percentage of students not covered due to within-school exclusions: %(box [d] / box [a]) x 100
5. Total enrolment in national defined target population: [e](box [a] – (box [b] + box [d])
6. Coverage of national desired target population: [f](box [e] / box [a]) x 100
7. Describe your data source (Provide copies of relevant tables):
Ap
pen
dic
es
307
PISA Sampling Form 4 Sampling Frame Description
See Sections 5.2 – 5.4 of School Sampling Manual.
PISA Participant:
National Project Manager:
1. Will a sampling frame of geographic areas be used?
Yes Go to 2
No Go to 5
2. Specify the PSU Measure of Size to be used.
15-year-old student enrolment
Total student enrolment
Number of schools
Population size
Other (please describe):
3. Specify the school year for which enrolment data will be used for the PSU Measure of Size:
4. Please provide a preliminary description of the information available to construct the area frame,Please consult with Westat for support and advice in the construction and use of an area-levelsampling frame.
5. Specify the school estimate of enrolment (ENR) of 15-year-olds that will be used.
15-year-old student enrolment
Applying known proportions of 15-year-olds to corresponding grade level enrolments
Grade enrolment of the modal grade for 15-year-olds
Total student enrolment, divided by number of grades
6. Specify the year for which enrolment data will be used for school ENR.
7. Please describe any other type of frame, if any, that will be used.
PIS
A 2
00
0 t
ech
nic
al r
epor
t
308
PISA Sampling Form 5 Excluded Schools
See Section 5.5 of School Sampling Manual.
PISA Participant:
National Project Manager:
Use additional sheets if necessary
School ID Reason for exclusion School ENR
Ap
pen
dic
es
309
PISA Sampling Form 6 Treatment of Small Schools
See Section 5.7 of School Sampling Manual.
PISA Participant:
National Project Manager:
1. Enrolment in small schools:
Type of school based on enrolment Number of Number of Percentage of schools students Total enrolment
Enrolment of 15-year-old students < 17 [a]
Enrolment of 15-year-old students ≥ 17 and < 35 [b]
Enrolment of 15-year-old students ≥ 35 [c]
TOTAL 100%
2. If the sum of the percentages in boxes [a] and [b] is less than 5%, AND the percentage in box [a] is less than 1%,then all schools should be subject to normal school sampling, with no explicit stratification of small schoolsrequired or recommended.
(box [a] + box [b]) < 5% and box [a] < 1%? Yes or No⇓Go to 6.
3. If the sum of the percentages in boxes [a] and [b] is less than 5%, BUT the percentage in box [a] is 1% or more, astratum for very small schools is needed. Please see section 5.7.2 to determine an appropriate school sampleallocation for this stratum of very small schools.
(box [a] + box [b]) < 5% and box [a] ≥ 1%? Yes or No⇓Form an explicit stratum of very small schools and record this on Sampling Form 7.
4. If the sum of the percentages in boxes [a] and [b] is 5% or more, BUT the percentage in box [a] is less than 1%,an explicit stratum of small schools is required, but no special stratum for very small schools is required. Pleasesee Section 5.7.2 to determine an appropriate school sample allocation for this stratum of small schools.
(box [a] + box [b]) ≥ 5% and box [a] < 1%? Yes or No⇓Form an explicit stratum of small schools and record this on Sampling Form 7. Go to 6.
5. If the sum of the percentages in boxes [a] and [b] is 5% or more, AND the percentage in box [a] is 1% or more, anexplicit stratum of small schools is required, AND an explicit stratum for very small schools is required. Please seesection 5.7.2 to determine an appropriate school sample allocation for these strata of moderately small and verysmall schools.
(box [a] + box [b]) ≥ 5% and box [a] ≥ 1%? Yes or No⇓Form an explicit stratum of moderately small schoolsand an explicit stratum of very small schools, and recordthis on Sampling Form 7.
6. If the percentage in box [a] is less than 0.5%, then these very small schools can be excluded from the nationaldefined target population only if the total extent of school level exclusions of the type mentioned in 3.3.2 remainsbelow 0.5%. If these schools are excluded, be sure to record this exclusion on Sampling Form 3, item 2.
box [a] < 0.5%? Yes or No
⇓ Excluding very small schools? Yes or No
PIS
A 2
00
0 t
ech
nic
al r
epor
t
310
PISA Sampling Form 7 Stratification
See Section 5.6 of School Sampling Manual.
PISA Participant:
National Project Manager:
Explicit Stratification
1. List and describe the variables used for explicit stratification.
Explicit stratification variables Number of levels
1
2
3
4
5
2. Total number of explicit strata:
(Note: if the number of explicit strata exceeds 99, the PISA school coding scheme will not work correctly.See Section 6.6. Consult Westat and ACER.)
Implicit Stratification
3. List and describe the variables used for implicit stratification in the order in which they will be used(i.e., sorting of schools within explicit strata).
Implicit stratification variables Number of levels
1
2
3
4
5
Ap
pen
dic
es
311
PISA Sampling Form 8 Population Counts by Strata
See Section 5.9 of School Sampling Manual.
PISA Participant:
National Project Manager:
Use additional sheets if necessary
(1) (2) (3) (4) (5)
Population Counts
Schools Students
Explicit Strata Implicit Strata ENR MOS
PIS
A 2
00
0 t
ech
nic
al r
epor
t
312
PISA Sampling Form 9 Sample Allocation by Explicit Strata
See Section 6.2 of School Sampling Manual.
PISA Participant:
National Project Manager:
Use additional sheets if necessary
(1) (2) (3) (4) (5) (6) (7) Sample Allocation
% of % of eligible
eligible No. of studentsStratum students in schools in expected
Explicit Strata Identification population population Schools Students in sample
Ap
pen
dic
es
313
PISA Sampling Form 10 School Sample Selection
See Section 6.3 of School Sampling Manual.
PISA Participant:
National Project Manager:
Explicit Stratum: Stratum ID
S D I RN[a] Total Measure of Size [b] Desired Sample Size [c] Sampling Interval [d] Random Number
Use additional sheets if necessary
(1) (2)Line Numbers Selection Numbers
PIS
A 2
00
0 t
ech
nic
al r
epor
t
314
PISA Sampling Form 11 School Sampling Frame
See Sections 6.4 - 6.6 of School Sampling Manual.
PISA Participant:
National Project Manager:
Explicit Stratum: Stratum ID
Use additional sheets if necessary
(1) (2) (3) (4) (5) (6) School List Cumulative PISA School
Identification Implicit Stratum MOS MOS Sample Status Identification
Ap
pen
dic
es
315
Appendix 10: Student Listing Form
Page __ of __
School Identification: Country Name:
School Name: List Prepared By:
Address: Telephone Number:
Date List Prepared:
Total Number Students Listed:
DIRECTIONS: PLEASE COMPLETE COLUMNS A, B, C AND D, FOR EVERY STUDENT <BORN IN 1984>.
Include students who may be excluded from other testing programs, such as some students withdisabilities or limited language proficiency. Detailed instructions and information about providingcomputer-generated lists are on the other side of this page.
For Sampling Only (A) (B) (C) (D)Selected Student’s Name Sex BirthStudent Date
(Enter “S”) Line # (First Middle Initial Last) Grade (M / F) (mm/yy)
PIS
A 2
00
0 t
ech
nic
al r
epor
t
316
A. Instructions for Preparing a List of Eligible Students
1. Please prepare a list of ALL students <born in 1984. . .NPM must insert eligibility criteria> using the most currentenrolment records available.
2. Include on the list students who typically may be excluded from other testing programs (such as some students withdisabilities or limited language proficiency).
3. Write the name for each eligible student. Please also specify current grade, sex, and birth date for each student.
4. If confidentiality is a concern in listing student names, then a unique student identifier may be substituted. Becausesome students may have the same or similar names, it is important to include a birth date for each student.
5. The list may be computer-generated or prepared manually using the PISA Student Listing Form. A Student ListingForm is on the reverse side of these instructions. You may copy this form or request copies from your NationalProject Manager.
6. If you use the Student Listing Form on the reverse side of this page, do not write in the “For Sampling Only”columns.
7. Send the list to the National Project Manager (NPM) to arrive no later than <insert DATE>. Please address to theNPM as follows: <insert name and mailing address>
B. Suggestions for Preparing Computer-Generated Lists
1. Write the school name and address on list.
2. List students in alphabetical order.
3. Number the students.
4. Double-space the list.
5. Allow left-hand margin of at least two inches.
6. Include the date the printout was prepared.
7. Define any special codes used.
8. Include preparer’s name and telephone number.
Ap
pen
dic
es
317
PIS
A 2
00
0 t
ech
nic
al r
epor
t
318
Ap
pe
nd
ix 1
1:S
tu
de
nt
Tr
ac
kin
g F
or
m
Page __ of __
Cou
ntry
Nam
e:__
____
____
____
____
____
____
____
____
____
____
Stra
tum
Num
ber:
___
____
____
____
_Sc
hool
Nam
e:__
____
____
____
____
____
____
____
____
____
____
Scho
ol I
D:
____
____
____
____
____
__
SAM
PL
ING
IN
FOR
MA
TIO
N
(A)
(B)
(C)
(D)
(E)
(F)
Num
ber
of S
tude
nts
Num
ber
of S
tude
nts
Lis
ted
Sam
ple
Size
Ran
dom
Sa
mpl
ing
Firs
t L
ine
Num
ber
Age
d 15
for
Sam
plin
gN
umbe
rIn
terv
alSe
lect
ed[(
Box
D X
Box
E)
+ 1
____
____
__
____
__
____
____
0.
____
___
____
____
__
____
__
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
Part
icip
atio
n St
atus
Lin
e (9
)(1
0)N
umbe
r G
ende
rB
irth
Dat
e E
xclu
ded
Boo
klet
Ori
gina
l Ses
sion
Follo
w-u
p Se
ssio
n
ID(S
ampl
e)St
uden
t N
ame
Gra
de(M
/F)
(MM
-YY
)C
ode
Num
ber
P1
P2
SQ
P1
P2
SQ
1 2 3 4 5 6 7 8 9 10 11 12 13
14
15
EX
CL
USI
ON
CO
DE
S (C
ol. 7
)PA
RT
ICIP
AT
ION
CO
DE
S (C
ols.
9&
10)
1 =
Func
tion
ally
dis
able
d0
= A
bsen
t2
= E
duca
ble
men
tally
ret
arde
d1
= Pr
esen
t3
= L
imit
ed t
est
lang
uage
pro
fici
ency
8 =
Not
app
licab
le, i
.e.,
excl
uded
, no
long
er in
sch
ool,
not
age
elig
ible
4 =
Oth
er
(con
tinu
ed)
PISA
ST
UD
EN
T T
RA
CK
ING
FO
RM
Page
__
of _
_C
ount
ry N
ame:
____
____
____
____
____
____
____
____
____
____
__St
ratu
m N
umbe
r: _
____
____
____
___
Scho
ol N
ame:
____
____
____
____
____
____
____
____
____
____
__Sc
hool
ID
: __
____
____
____
____
____
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
Part
icip
atio
n St
atus
Lin
e (9
)(1
0)N
umbe
rG
ende
rB
irth
Dat
e E
xclu
ded
Boo
klet
Ori
gina
l Ses
sion
Follo
w-u
p Se
ssio
n
ID(S
ampl
e)St
uden
t N
ame
Gra
de(M
/F)
(MM
-YY
)C
ode
Num
ber
P1
P2
SQ
P1
P2
SQ
16
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
Ap
pen
dic
es
319
Appendix 12: Adjustment to BRR for Strata with oddnumbers of Primary Sampling Units
Splitting an odd-sized sample into unequal half-samples without adjustment results in positively biasedestimates of variance with BRR. The bias may not be serious if the number of units is large and if thesample has a single-stage design. If, however, the sample has a multi-stage design and the number ofsecond-stage units per first-stage unit is even moderately large, the bias may be severe. The followingadjustment, developed by David Judkins of Westat, removes the bias.
The units in each stratum are divided into half-samples. Without loss of generality, assume that thefirst half-sample is the smaller of the two if the number of sample units in a stratum is odd. For PISA,
or 3. Let =least integer greater than or equal to ( = 1 or 2 for PISA), =greatest integerless than or equal to ( = 1 for PISA), 0≤k<1 be the Fay factor (k=0.5 for PISA) and
according to a rule such that for (as is obtained from a Hadamard matrix of
appropriate size).
Then the replication factors for the first and second half-samples are:
. (89)
Hence for PISA, if , the replication factors are and , that is 1.5 and 0.5, or
0.5 and 1.5, for the first and second half-samples respectively. If , the replication factors are:
and , (90)
that is, 1.7071 and 0.6464, or 0.2929 and 1.3536, where the first factor in each case is the one that appliesto the single PSU in the first half-sample, while the second factor applies to the two PSUs in the second half-sample.
ProofLet be an unbiased estimate of the characteristic of interest for the i-th unit in the h-th stratum. Let
be the full sample estimate. Let and denote membership of half-samples 1 and 2
respectively. Let L be the number of strata and T be the number of replicates (T=80 for PISA). Let
.
Then
. (91)
vk T
d k x dk
x x
k Td k
ht h hi hi hth
hi hii
n
i
n
h
L
t
T
ht h hi
hh
=−( )
+ −( )( ) + −−( )
−
=−( )
−( )
====∑∑∑∑1
1
11 1 1
1
1
1
11
2 1 2111
2
1
2
α δα
δ
α δ 11 2111
2
1
1 2111
11
1 1
x d k x
Td x x
hi hth
hi hii
n
i
n
h
L
t
T
ht h hi hih
hi hii
n
i
n
h
L
hh
hh
− −( )
= −
====
===
∑∑∑∑
∑∑∑
αδ
α δα
δ
22
1t
T
=∑
α h
h
h
u=
l
δhi2δhi1x xh i
hi= ∑ ∑xhi
1 2 2−( )dht1 2+ ( )dht
nh = 3
1 2− dht1 2+ dhtnh = 2
1 1 1 1+ −( ) − −( )d k
ud k
uhth
hthl
l
h
h and
h h≠ 'd dht h t' =∑ 0
dht = ±1 lhnh 2 lhuhnh 2uhnh = 2
nh
PIS
A 2
00
0 t
ech
nic
al r
epor
t
320
Given the balancing property of , this simplifies to:
. (92)
For convenience, rewrite the simplified as:
. (93)
To prove unbiasedness, note that , where (assuming identicaldistribution within stratum), and that:
. (94)
Note furthermore that under simple random sampling with replacement,
(95)
Thus,
(96)
which is the true sampling variance of x.
E v E x x Var x Var x
u
n
h h h hh
L
h
L
h h hh
L
h hh
L
ˆ ˆ ˆ ˆ ˆ
,
( ) = −( ) = ( ) + ( )[ ]
= +( )
=
==
=
=
∑∑
∑
∑
1 22
1 211
2
1
2
1
l σ
σ
Var x Var x
u
Var x u
h h h h h hi
h h
hh
h h
h
ˆ
ˆ
.
12 2 2
2
2 22
2
1
( ) = = ( )=
( ) =
=
α σ σ
σ
ασ
σ
l
l
, where
and
h
E x u X X u E xh
hh h h h h hˆ ˆ2 1
1( ) = = = ( )α
l
X E xh hi= ( ) E x X X uh h h h h h hˆ 1( ) = =α l l
ˆ ˆ ˆv x xh hh
L
= −( )=
∑ 1 22
1
v
vT
d x x
x x
hth
L
h hi hii
n
hhi hi
i
n
t
T
h hi hih
hi hii
n
i
n
h
L
h h
hh
= −
= −
= = ==
===
∑ ∑ ∑∑
∑∑∑
1 1
1
2
11
12
1
2
1
1 211
2
1
α δα
δ
α δα
δ
dht
Ap
pen
dic
es
321