a meta-analysis of gender differences in reading achievement at the secondary school level

28
www.elsevier.com/stueduc Studies in Educational Evaluation 32 (2006) 317–344 A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL Petra Lietz International University Bremen, Germany 1 Abstract This study conducts a meta-analysis using hierarchical linear modeling (HLM) to address the questions of the extent of gender differences in reading, whether or not these differences apply similarly across English and non-English speaking countries and decrease with age. Female secondary students performed 0.19 standard deviation units above their male peers, regardless of age, language of instruction, and whether effect sizes were based on mean differences or correlation coefficients. However, differences emerged between assessment programs, with recent Australian Studies, the National Assessment of Educational Progress Studies (NAEP) and the Programme for International Student Assessment (PISA) reporting greater gender differences. Introduction The research reported in this article extends across two major areas: one content- related, namely gender differences in reading, and one method-related, namely meta- analysis. Each of these areas is discussed below. Gender Differences in Reading Achievement The view prevails that boys perform better than girls in mathematics and the natural sciences whereas the reverse holds in languages, social studies and reading. A closer 0191-491X/04/$ – see front matter # 2006 Published by Elsevier Ltd. doi:10.1016/j.stueduc.2006.10.002

Upload: petra-lietz

Post on 28-Oct-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

www.elsevier.com/stueduc

Studies in Educational Evaluation 32 (2006) 317–344

A META-ANALYSIS OF GENDER DIFFERENCES IN READINGACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

Petra Lietz

International University Bremen, Germany1

Abstract

This study conducts a meta-analysis using hierarchical linear modeling (HLM) toaddress the questions of the extent of gender differences in reading, whether or notthese differences apply similarly across English and non-English speaking countriesand decrease with age. Female secondary students performed 0.19 standard deviationunits above their male peers, regardless of age, language of instruction, and whethereffect sizes were based on mean differences or correlation coefficients. However,differences emerged between assessment programs, with recent Australian Studies,the National Assessment of Educational Progress Studies (NAEP) and the Programmefor International Student Assessment (PISA) reporting greater gender differences.

Introduction

The research reported in this article extends across two major areas: one content-related, namely gender differences in reading, and one method-related, namely meta-analysis. Each of these areas is discussed below.

Gender Differences in Reading Achievement

The view prevails that boys perform better than girls in mathematics and the naturalsciences whereas the reverse holds in languages, social studies and reading. A closer

0191-491X/04/$ – see front matter # 2006 Published by Elsevier Ltd.

doi:10.1016/j.stueduc.2006.10.002

Page 2: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344318

examination of the research on reading, however, reveals that this is not as clear-cut as itappears and that results can be grouped into two main categories: One showing evidence ofgirl's superiority over boys in reading achievement, and one providing no evidence ofgender differences.

In a large-scale study of students in Grades 2, 4, and 6 in which sampling wasconfined to small geographical areas in four English-speaking countries, namely Canada,England, Nigeria and the U.S., Johnson (1973-1974) found evidence to support the viewput forward by Preston (1962) that gender differences in reading were culturally rather thanphysiologically determined because gender differences varied between countries: InEngland and Nigeria, boys' mean performance exceeded that of girls whereas the reverseapplied in Canada and the United States.

In 1990/91, the Reading Literacy Study was undertaken by the InternationalAssociation for the Evaluation of Educational Achievement (IEA). Out of a total of 31education systems that participated at the 14-year-old level, Wagemaker, Taube, Munck,Kontogiannopoulou-Polydorides and Martin (1996) reported significant gender differenceson the total reading score for 12 systems with 10 favouring girls and 2 favouring boys.

In 1995, a reading study of Grade 6 students was conducted by the Southern AfricaConsortium for Monitoring Educational Quality using the instruments designed by the IEAReading Literacy Study. Saito (1998) calculated differences in mean percentages of correctanswers for boys and girls on the total reading score. While girls performed better thanboys in Mauritius and Zimbabwe, boys achieved higher scores in Namibia, Zambia andZanzibar. These differences, however, were not significant at the 95% confidence level asnone of the differences was greater than or equal to two sampling errors. Likewise, in anapplication of the IEA Reading Literacy Study in Latvia, Dedze (1995) reported highermean achievement of female students (485) than male students (476). Again, however,these differences turned out not to be significant when the standard errors (6.2 for boys and6.4 for girls) were taken into account.

In Western Australia, the program "Monitoring Standards in Education" includedassessments of reading of students in Years 7 and 10 in 1992, 1995, 1997, and 2001(Department of Education and Training, 2003). Results revealed a decline in averagereading performance over that 10-year-period at the Year 10 level whereas averageperformance remained stable at the Year 7 level. The gender differences in readingachievement were similar at Year 7 and Year 10 levels, except in the 1997 assessmentwhere the gender difference for Year 7 students was far smaller than the one for Year 10students. Girls outperformed boys across all Year levels and on all occasions.

The reading assessment component of the 2003 National Assessment of EducationalProgress (NAEP) conducted in the United States was the sixth assessment in a series thathad started in 1992. The assessment of the 8th grade students comprised three main typesof exercises, namely reading for literary experience, reading to gain information, andreading to perform a task. Results of the 2003 assessment were averaged across the threetypes of exercises and revealed that scores for female students were 11 points higher onaverage than scores for male 8th graders. This difference was the same as it had been inprevious NAEP assessment cycles (Plisko, 2003).

There is, however, some supportive evidence for the claim that boys and girlsperform equally well in reading.

Page 3: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344 319

The first major international comparative study into reading, the ReadingComprehension Study, was conducted by IEA as part of the Six Subject Survey in 1970/71.Thorndike (1973), who reported the findings of the study, found median correlationsbetween selected background variables and reading comprehension scores across all 15participating education systems, namely Belgium (Flemish), Belgium (French), Chile,England, Finland, Hungary, India, Iran, Israel, Italy, the Netherlands, New Zealand,Scotland, Sweden, and the United States of America. While the coefficients at the 14-year-old level were positive, indicating a higher performance of girls, the gender differenceswere not significant in spite of the large and nationally representative samples employed.Indeed the median correlations for sex of students were the smallest when compared toother student background factors, and far behind reading resources and father's occupation.

Hogrebe, Nist and Newman (1985) undertook a review of research on genderdifferences in reading research at both the primary and secondary school levels. Accordingto their review, no evidence for significant gender differences in reading had emerged fromstudies conducted in the 1930s and 1940s. Their analysis of the 1980 data set of the HighSchool and Beyond data base showed that gender accounted for less than one per cent ofthe variance in reading achievement test scores. This was the case not only in the overallsample but also when gender differences were analyzed for different subgroups for whichsuch differences could be assumed to emerge more readily, e.g., low achieving students.

Likewise, in a systematic meta-analysis on the topic of gender differences in readingachievement, Knickerbocker (1989) concluded that, in general, studies have not providedany evidence for the superiority of girls over boys.

In conclusion, it can be argued that while the research discussed above providesmore support for the existence of a gender gap in reading performance in favour of femalestudents, some studies dispute this finding. Moreover, studies provide inconclusiveevidence as regards the extent of gender differences in reading at the secondary schoollevel. Hence, a more systematic approach to integrating research findings, namelystatistical meta-analysis (Glass, McGaw, & Smith 1981; Hunter & Schmidt, 2004) issuggested and discussed in greater detail below.

Meta-Analysis

The term "meta-analysis" was first introduced by Glass (1976, 1977) to denote asystematic integration of research findings on a specific topic, and has been developedfurther as an analytical technique (Rosenthal, 1984). The need for a more systematic way ofintegrating prior research than by means of narrative research reviews was identified as areaction to criticisms aimed at the social sciences by both funding agencies and the publicas to whether or not any progress was being made in terms of deriving firm knowledgefrom the seemingly abounding and contradictory evidence generated from many researchabundant in the social sciences (Light & Smith, 1971).

As Hunter and Schmidt (2004, p. 16) emphasize: "In many areas of research, theneed today is not for additional empirical data but for some means of making sense of thevast amounts of data that have been accumulated." Moreover, they point out that thenarrative integration of research findings has serious shortcomings in that this strategy ofintegrating research results often leads to different conclusions if done by different people.

Page 4: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344320

Statistical meta-analysis, in contrast, as a quantitative way of integrating research findingsshould lead to the same conclusion, regardless of the person applying the procedure.

Thus, the challenge in the social sciences in general and in educational research inparticular, is to integrate systematically and quantitatively findings from the large numberof research studies that have been undertaken in order to contribute empirically verifiedfacts to the cumulative body of knowledge.

Various researchers have taken up the challenge and undertaken stringent statisticalmeta-analyses to summarize research efforts in education on a number of different topics.Kulik, Bangert, and Williams (1983), for example, investigated the effects of computer-based teaching on secondary school students. In their meta-analysis of results of 51independent research studies on this issue, they found that students who experiencedcomputer-based instruction performed about 0.32 standard deviations better than studentstaught by traditional methods. The authors also investigated twelve study characteristics –ranging from whether the computer was used for managing, tutoring, simulation,programming or drill to whether the instructional materials had been produced locally orcommercially – and their potential impact on effect size. Only year of publication andduration of the study were found to have a significant impact on effect size suggesting thatmore recent and shorter studies had a more positive effect on performance than older andlonger studies.

Bangert-Drowns, Hurley and Wilkinson (2004) reported results of a meta-analysis inwhich they summarized 48 independent studies evaluating school-based writing-to-learnprograms. The effect sizes that the authors used were calculated on the basis ofstandardized mean differences whereby the mean academic performance of theconventional learning group was subtracted from the writing-to-learn group and divided bythe standard deviation pooled across the two groups. In cases where primary study hadreported only t-tests or F-values, the authors used transformations suggested by Glass et al.(1981) to arrive at effect sizes. Bangert-Drowns et al. (2004) argued that assigning greaterweight to larger studies through weighing each effect size by the inverse of the effect size'ssampling error as suggested by Hedges (1994) might not be warranted. In order to shedlight on this issue, they performed the analyses with both weighted and unweighted meansof Cohen's d (Cohen, 1988). Results showed an average unweighted Cohen's d of 0.26standard deviations, with a 95% confidence interval from 0.15 to 0.38, compared with anaverage weighted Cohen's d of 0.17 with a 95% confidence interval from 0.11 to 0.22. Asthe unweighted as well as the weighted average effect size differed significantly from zero,the authors proceeded to examine whether or not a number of variables such as treatmentcontext, treatment intensity or features of the writing task had an impact on effect size.Analyses with both weighted and unweighted effect sizes led to the same conclusion,namely that minutes per in-class writing task, metacognitive stimulation and students in themiddle grades – when compared with elementary, high school and college students – weresignificantly related to effect size.

Cohen, Kulik and Kulik (1982) undertook a meta-analysis of 65 independent studiesthat investigated the impact of tutoring on student performance. Like Kulik Bangert andWilliams (1983), the authors ensured that each primary study was only included once in themeta-analysis. To this end, where multiple reports described findings from the same study,the most complete report was used, and where multiple measures of outcomes or measures

Page 5: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344 321

for different school subjects were reported, these were combined into one compositemeasure. In addition to the effect sizes, the authors examined 13 study features, includingwhether or not the tutors had received training, whether or not students were assignedrandomly to study groups, whether the performance assessed by way of a commercial or ateacher-designed test and the class level of tutors. Six study features were shown to have animpact on the effect size between tutoring and performance of the tutored student, wherebyeffect sizes from studies were higher when: (a) performance was measured by an instructor-developed test; (b) lower-order tasks were assessed; (c) tutoring was structured; (d)duration of tutoring was shorter (i.e., up to 18 weeks); (e) primary study results wereunpublished; and (f) mathematics rather than reading was the subject matter.

Other meta-analyses have focused on a number of different aspects of readingachievement. Thus, Neuman (1986) undertook a meta-analysis of eight state-wideassessment and data from the 1984 NAEP study in order to examine the link betweenreading achievement, TV viewing and related reading activities. Results of the meta-analysis showed that TV viewing contributed not much to the explanation of differences inreading and that TV viewing did not replace time spent on leisure time reading. In the areaof primary reading instruction, Stahl and Miller (1988) conducted a meta-analysis of 34studies to compare the effectiveness of basal reader approaches and the languageexperience approach. No significant differences in effect sizes emerged depending on thetype of approach employed which suggested similar effectiveness of both approaches.

However, none of these meta-analyses have focused specifically on genderdifferences in reading. In addition, advances in hierarchical linear modeling (HLM) haveoccurred which allow for the clustered nature of meta-analytic analyses to be taken intoaccount more appropriately. Thus, Raudenbush and Bryk (2002) have argued that the mainpurpose of a meta-analysis is to examine the extent to which effects reported in the resultsof primary study are consistent and to disentangle what part of the variance in study resultsis due to sampling error and which component is due to actual treatment implementation.As a consequence, Raudenbush and Bryk (2002) have proposed an empirical Bayes meta-analysis as a special application of the two-level hierarchical linear model. In this model,the outcome variable, namely the effect sizes from the different studies, was allowed tovary randomly at the first level while, at the second level, study characteristics were used toexplain possible differences in the outcome variable. In other words, the Level-1 analyseswere aimed at investigating the extent of the variance in effects sizes of primary studieswhile at Level-2 possible sources of this variation might be examined. This extension totwo levels was based on the use of ordinary regression models in research synthesisproposed originally by Hedges and Olkin (1983).

In summary, meta-analysis is a systematic way to synthesize findings of researchstudies on a certain topic. After a systematic search and retrieval of relevant studies, theirresults are scaled to a common unit of measurement, expressed as effect sizes, usually d(Cohen 1988), and corrected for different sources of error, in particular sampling error. Theassumption of the meta-analytic approach is that these disattenuated effect sizes are allestimates of a common effect that underlies a whole population of studies. Where variationin effect sizes emerges that is not due to sampling error, the analysis seeks to explain thosedifferences in terms of variation arising from the different contexts and characteristics ofthe primary studies. As a result of this process, meta-analysis allows the:

Page 6: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344322

• estimation of effect size parameters,• explanation of differences in estimates of effect size,• examination of stronger estimates of effect sizes in particular situations, and• modelling of factors producing effects in different contexts and under different

conditions.

Method

It has been argued (e.g., Cook et al., 1992) that meta-analyses frequently suffer froma lack of transparency as regards the inclusion or exclusion of primary studies. In order toincrease transparency, a summary of the principles guiding the selection of primary studieswhose results entered the current meta-analysis is given in Table 1.

Authors differ in their views on which primary studies to include in a meta-analysis.Slavin (1984, 1986), for example, argued that only primary studies of soundmethodological quality should be included in a meta-analysis. Glass et al. (1981), on theother hand, claimed that the breadth of the available evidence should be used whensynthesizing the current state of knowledge in a particular research area. This view was alsosupported by Kulik and Kulik (1989) who argued that meta-analyses with a high-qualityapproach to selecting primary studies were often left with too few studies to allow thestatistical analysis of the results.

It should be noted that over and above the criteria given in Table 1, no furtherevaluation of studies was undertaken to determine the inclusion or exclusion of studiesentering the current meta-analysis.

In Table 2 an overview of the studies included in this meta-analysis is providedwhereby national studies or authors analyzing data from national studies are listed firstfollowed by international assessment programs. After the sequential study number inColumn 1, the name of the study or the name of the author who reported the study is listedin the second column and followed by information about the country in which the studywas conducted in the third column.

The data that are used in the meta-analysis are provided in Columns 4 to 14. Thefirst of these columns contains the effect size in the form of Cohen's d. Effect size isdefined by Cohen as follows:

…it is convenient to use the phrase "effect size" to mean " the degree to whichthe phenomenon is present in the population", or "the degree to which the nullhypothesis is false. Whatever the manner of representation of a phenomenon ina particular research in the present treatment, the null hypothesis always meansthat the effect size is zero. (1988, p. 8)

Page 7: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344 323

Table 1: What the Meta-Analysis Is (Not) About

Qualifier Not about AboutEnglish, verbalability

Studies that used Grade in English or onlygeneral verbal ability as a measure

Studies had to include some measure ofreading comprehension or readingachievement in the language of instruction

Academicachievement

Studies that did not separate out differentaspects of academic achievement – andfor example combined mathematics andreading in a single outcome measure

Studies had to include some measure ofreading comprehension or readingachievement in the language of instruction

Language Not reading as part of foreign languagelearning

Focus was on mother-tongue reading orreading in the languages of instruction.

Informationprovided

Policy papers, discussion/opinion papers,narrative reviews

Studies had to provide some dataamenable to meta-analysis (means,correlations, regression/path coefficients)

Level ofschooling

Primary school level Secondary school level (i.e. Grade 6 or12-year-old students to Grade 12 or 18-year-old students)

Type of variable Studies that used reading as a predictor,mediator or moderator.

Studies that used reading achievement orreading comprehension as the outcomevariable or which focused on correlatingvarious factors or variables with readingachievement.

Type of resultsreported

Studies reporting percentage differencesin achievement

Studies reporting mean differences orcorrelations

Readingdimension

Comprehension of a specific type of textor using reading for a specific purpose(e.g. RL"s "documents", "expository","narrative" domains or PISA"s"retrieving", "interpreting" and"reflecting" and "evaluation" skills)

An overall score of performance inreading

Type of student Samples that focused on students withdisabilities, ethnic minority students

Samples that were representative ofmainstream secondary school students

Level of datacollection

If teacher ratings of student achievementwere used; analyses reported at schoollevel (e.g. headmaster studies)

Studies had to focus on student-levelvariables. Information provided bystudents.

Type ofinformation

If results were not separated in studies ofprimary and secondary school students

Information on effect sizes (e.g.,correlation coefficient or meandifferences) had to be reported forsecondary school students)

Type ofpublication

Dissertations Journal articles (as retrieved from a searchusing "secondary" and "student factors"and "reading achievement" or "readingperformance" in ERIC, Web of Science andPsycINFO and selected according to thecriteria in this table) or published studyreports

Date of study Prior to 1970 or after 2002 1970-2002

Page 8: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344324

Table 2: Data Used in the Meta-Analysis, in Alphabetical Order of Author/Study

No. Study/Author Country ES(d) vEng-lish

MeanAge RC RL PISANAEP AUS

ESMean

1 ASSP 1975 Australia 0.090 0.001 1 14.00 0 0 0 0 1 12 ASSP 1980 Australia 0.110 0.001 1 14.00 0 0 0 0 1 13 YIT 1989 Australia 0.080 0.001 1 14.00 0 0 0 0 1 14 LSAY 1995 Australia 0.190 0.000 1 14.00 0 0 0 0 1 15 LSAY 1998 Australia 0.230 0.000 1 14.00 0 0 0 0 1 16WA monitoring 1992 Australia 0.313 0.003 1 12.00 0 0 0 0 1 17 Australia 0.344 0.003 1 15.00 0 0 0 0 1 18WA monitoring 1995 Australia 0.344 0.003 1 12.00 0 0 0 0 1 19 Australia 0.389 0.003 1 15.00 0 0 0 0 1 110WA monitoring 1997 Australia 0.193 0.003 1 12.00 0 0 0 0 1 111 Australia 0.448 0.003 1 15.00 0 0 0 0 1 112WA monitoring 1999 Australia 0.406 0.003 1 12.00 0 0 0 0 1 113 Australia 0.434 0.003 1 15.00 0 0 0 0 1 114WA monitoring 2001 Australia 0.306 0.000 1 12.00 0 0 0 0 1 115 Australia 0.496 0.004 1 15.00 0 0 0 0 1 116WA monitoring 2002 Australia 0.230 0.000 1 12.00 0 0 0 0 1 117 Fuller et al. 1994 Botswana 0.143 0.000 1 15.00 0 0 0 0 0 118 Botswana 0.192 0.000 1 16.00 0 0 0 0 0 119Glossop et al. 1979 England -0.155 0.006 1 15.00 0 0 0 0 0 020Gorman et al. 1982

Engl., Wales,Nth. Ireland 0.013 0.001 1 15.75 0 0 0 0 0 1

21 Nth. England 0.014 0.004 1 15.00 0 0 0 0 0 122 Midlands -0.040 0.005 1 15.00 0 0 0 0 0 123 Sth. England 0.025 0.003 1 15.00 0 0 0 0 0 124 Wales -0.023 0.005 1 15.00 0 0 0 0 0 125 Nth. Ireland 0.136 0.004 1 15.00 0 0 0 0 0 126 Youngman 1980 UK 0.040 0.003 1 12.00 0 0 0 0 0 027 UK 0.283 0.003 1 12.00 0 0 0 0 0 028 Hogrebe et al 1985 USA,HSB-80 -0.050 0.000 1 17.00 0 0 0 0 0 129 USA,HSB-80 -0.090 0.000 1 15.00 0 0 0 0 0 130 NAEP 2003 USA 0.220 0.005 1 13.00 0 0 0 1 0 131 NAEP 2002 USA 0.180 0.005 1 13.00 0 0 0 1 0 132 NAEP 1998 USA 0.280 0.005 1 13.00 0 0 0 1 0 133 NAEP 1994 USA 0.300 0.005 1 13.00 0 0 0 1 0 134 NAEP 1992 USA 0.260 0.005 1 13.00 0 0 0 1 0 135 NAEP 2002 USA 0.320 0.005 1 17.00 0 0 0 1 0 136 NAEP 1998 USA 0.320 0.005 1 17.00 0 0 0 1 0 137 NAEP 1994 USA 0.280 0.005 1 17.00 0 0 0 1 0 138 NAEP 1992 USA 0.200 0.005 1 17.00 0 0 0 1 0 139 Neuman&Prowda1982

Connecticut1978-79, USA 0.120 0.000 1 13.00 0 0 0 0 0 0

40 Connecticut1978-79, USA 0.100 0.000 1 16.00 0 0 0 0 0 0

41 Hedges&Nowell1995 USA, NELS-88 0.090 0.005 1 13.00 0 0 0 0 0 142 USA, NLS-72 0.050 0.005 1 17.00 0 0 0 0 0 143 USA, NLSY-80 0.180 0.005 1 18.50 0 0 0 0 0 144Oakland&Stern1989 Texas, USA 0.006 0.003 0 10.50 0 0 0 0 0 045 Project Talent 1960 USA 0.150 0.005 1 15.00 0 0 0 0 0 146 Shilling&Lynch 1985 Pennsylvania, 0.161 0.005 1 13.00 0 0 0 0 0 0

Page 9: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344 325

USA47 Johnson 1973-74 Canada 0.172 0.041 1 12.00 0 0 0 0 0 148 England -0.250 0.039 1 12.00 0 0 0 0 0 149 Nigeria -0.870 0.038 1 13.00 0 0 0 0 0 150 USA 0.103 0.041 1 12.00 0 0 0 0 0 151 PISA2000 Australia 0.330 0.001 1 15.00 0 0 1 0 0 152 PISA2000 Austria 0.250 0.001 0 15.00 0 0 1 0 0 153 PISA2000 Belgium 0.330 0.001 0 15.00 0 0 1 0 0 154 PISA2000 Canada 0.320 0.000 1 15.00 0 0 1 0 0 155PISA2000

CzechRepublic 0.370 0.001 0 15.00 0 0 1 0 0 1

56 PISA2000 Denmark 0.250 0.001 0 15.00 0 0 1 0 0 157 PISA2000 Finland 0.510 0.001 0 15.00 0 0 1 0 0 158 PISA2000 France 0.290 0.001 0 15.00 0 0 1 0 0 159 PISA2000 Germany 0.340 0.000 0 15.00 0 0 1 0 0 160 PISA2000 Greece 0.370 0.003 0 15.00 0 0 1 0 0 161 PISA2000 Hungary 0.310 0.002 0 15.00 0 0 1 0 0 162 PISA2000 Iceland 0.400 0.000 0 15.00 0 0 1 0 0 163 PISA2000 Ireland 0.290 0.001 1 15.00 0 0 1 0 0 164 PISA2000 Italy 0.380 0.001 0 15.00 0 0 1 0 0 165 PISA2000 Japan 0.300 0.004 0 15.00 0 0 1 0 0 166 PISA2000 Korea 0.140 0.001 0 15.00 0 0 1 0 0 167 PISA2000 Luxembourg 0.270 0.000 0 15.00 0 0 1 0 0 168 PISA2000 Mexico 0.210 0.001 0 15.00 0 0 1 0 0 169 PISA2000 New Zealand 0.460 0.001 1 15.00 0 0 1 0 0 170 PISA2000 Norway 0.430 0.001 0 15.00 0 0 1 0 0 171 PISA2000 Poland 0.360 0.002 0 15.00 0 0 1 0 0 172 PISA2000 Portugal 0.240 0.002 0 15.00 0 0 1 0 0 173 PISA2000 Spain 0.240 0.001 0 15.00 0 0 1 0 0 174 PISA2000 Sweden 0.370 0.001 0 15.00 0 0 1 0 0 175 PISA2000 Switzerland 0.300 0.002 0 15.00 0 0 1 0 0 176 PISA2000 UK 0.250 0.001 1 15.00 0 0 1 0 0 177 PISA2000 US 0.280 0.005 1 15.00 0 0 1 0 0 178 PISA2000 Albania 0.590 0.005 0 15.00 0 0 1 0 0 179 PISA2000 Argentina 0.440 0.005 0 15.00 0 0 1 0 0 180 PISA2000 Brazil 0.160 0.001 0 15.00 0 0 1 0 0 181 PISA2000 Bulgaria 0.480 0.005 0 15.00 0 0 1 0 0 182 PISA2000 Chile 0.250 0.005 0 15.00 0 0 1 0 0 183 PISA2000 Hong Kong 0.150 0.005 1 15.00 0 0 1 0 0 184 PISA2000 Indonesia 0.200 0.005 1 15.00 0 0 1 0 0 185 PISA2000 Israel 0.150 0.005 1 15.00 0 0 1 0 0 186 PISA2000 Latvia 0.530 0.005 0 15.00 0 0 1 0 0 187 PISA2000 Liechtenstein 0.320 0.002 0 15.00 0 0 1 0 0 188 PISA2000 Macedonia 0.510 0.005 0 15.00 0 0 1 0 0 189 PISA2000 Peru 0.060 0.005 0 15.00 0 0 1 0 0 190 PISA2000 Romania 0.130 0.005 0 15.00 0 0 1 0 0 191 PISA2000 Russia 0.380 0.002 0 15.00 0 0 1 0 0 192 PISA2000 Thailand 0.420 0.005 1 15.00 0 0 1 0 0 193 PISA2000 Netherlands 0.300 0.001 0 15.00 0 0 1 0 0 194 RC 1970-71 Belgium(Fl.) 0.100 0.036 0 14.00 1 0 0 0 0 095 RC Belgium(Fr.) 0.345 0.056 0 14.00 1 0 0 0 0 096 RC Chile -0.242 0.010 0 14.00 1 0 0 0 0 097 RC England 0.201 0.007 1 14.00 1 0 0 0 0 0

Page 10: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344326

98 RC Finland 0.000 0.014 0 14.00 1 0 0 0 0 099 RC Hungary 0.040 0.005 0 14.00 1 0 0 0 0 0100 RC India 0.040 0.007 1 14.00 1 0 0 0 0 0101 RC Iran -0.060 0.033 0 14.00 1 0 0 0 0 0102 RC Israel -0.060 0.008 1 14.00 1 0 0 0 0 0103 RC Italy 0.040 0.003 0 14.00 1 0 0 0 0 0104 RC Netherlands -0.060 0.021 0 14.00 1 0 0 0 0 0105 RC New Zealand 0.040 0.014 1 14.00 1 0 0 0 0 0106 RC Scotland -0.140 0.015 1 14.00 1 0 0 0 0 0107 RC Sweden 0.120 0.011 0 14.00 1 0 0 0 0 0108 RC USA 0.080 0.007 1 14.00 1 0 0 0 0 0109 RL1990-91 Trin.&Tobago 0.299 0.011 1 14.40 0 1 0 0 0 1110 RL Thailand 0.304 0.007 1 15.20 0 1 0 0 0 1111 RL Ireland 0.284 0.007 1 14.50 0 1 0 0 0 1112 RL Canada(BC) 0.259 0.005 1 13.90 0 1 0 0 0 1113 RL Sweden 0.188 0.007 0 14.80 0 1 0 0 0 1114 RL Finland 0.215 0.015 0 14.70 0 1 0 0 0 1115 RL Hungary 0.192 0.007 0 14.10 0 1 0 0 0 1116 RL United States 0.153 0.006 1 15.00 0 1 0 0 0 1117 RL Iceland 0.167 0.007 0 14.80 0 1 0 0 0 1118 RL Italy 0.123 0.006 0 14.10 0 1 0 0 0 1119 RL Netherlands 0.118 0.006 0 14.30 0 1 0 0 0 1120 RL Cyprus 0.110 0.020 0 14.80 0 1 0 0 0 1121 RL Germany(E) 0.096 0.010 0 14.40 0 1 0 0 0 1122 RL Belgium(Fr.) 0.077 0.007 0 14.30 0 1 0 0 0 1123 RL Botswana 0.140 0.007 1 14.70 0 1 0 0 0 1124 RL Hong Kong 0.078 0.006 1 15.20 0 1 0 0 0 1125 RL New Zealand 0.054 0.008 1 15.00 0 1 0 0 0 1126 RL Philippines 0.077 0.004 1 14.50 0 1 0 0 0 1127 RL Slovenia 0.079 0.007 0 14.70 0 1 0 0 0 1128 RL Denmark 0.052 0.005 0 14.80 0 1 0 0 0 1129 RL Germany(W) 0.051 0.005 0 14.60 0 1 0 0 0 1130 RL Norway 0.056 0.007 0 14.80 0 1 0 0 0 1131 RL Spain 0.062 0.003 0 14.20 0 1 0 0 0 1132 RL Switzerland 0.041 0.003 0 14.90 0 1 0 0 0 1133 RL Venezuela 0.033 0.006 0 15.50 0 1 0 0 0 1134 RL Greece 0.015 0.007 0 14.40 0 1 0 0 0 1135 RL Nigeria 0.000 0.013 1 15.30 0 1 0 0 0 1136 RL Singapore 0.000 0.007 1 14.40 0 1 0 0 0 1137 RL France -0.059 0.008 0 15.40 0 1 0 0 0 1138 RL Portugal -0.133 0.008 0 15.60 0 1 0 0 0 1139 RL Zimbabwe -0.283 0.007 1 15.50 0 1 0 0 0 1

The reason for Cohen's emphasis on effect sizes stems from his criticism of thewidespread use of significance tests. Cohen points out that the reliance on such tests ismisleading not only in that a number of assumptions underlying these tests are frequentlynot met but in that these tests also provide less information than is possible: While asignificant test provides information only as to whether or not the null hypothesis is false,the effect size provides additional information regarding the specific degree to which thehypothesis is false.

Page 11: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344 327

Thus whether measured in one unit or another, whether expressed as adifference between two population parameters or the departure of a populationparameter from a constant or in any other suitable way, the ES can itself betreated as a parameter which takes the value zero when the null hypothesis istrue and some other specific nonzero value when the null hypothesis is false,and in this way the ES serves as an index of degree of departure from the nullhypothesis. (Cohen, 1988, p. 10)

The way in which to interpret the effect size of Cohen's d is as follows: If d iscalculated to be 0.2, then the means differ by two tenths of a standard deviation. Accordingto Cohen (1988, p.21) d is a pure number, which is freed of dependence upon any specificunit of measurement. A value of 2.0 for d indicates that the means differ by two standarddeviations. An examination of the effect sizes in the third column of Table 2 reveals thatvalues range from –0.87 (Study 49), indicating higher achievement of male students,through 0.00 (e.g., Studies 4, 5, 54, 59), indicating no gender differences in readingachievement, to 0.59 (Study 78), indicating a higher performance by female students byabout six tenths of a standard deviation.

The column that follows the effect size is labelled "v" which is the squared standarderror of d and reflects the hierarchical structure of the samples of the original studies(Raudenbush & Bryk, 1985; for further details on how "v" was calculated, see Equations 2and 3 below). In the next column, a "1" is assigned if the reading test was administered inEnglish to the whole or the majority of the sample and a "0" if the test was administered ina language other than English. The inclusion of this variable in the analysis, is intended toenable us to investigate the potential impact of whether or not the test is administered inEnglish on the variation in gender differences in reading. This is particularly interesting forthose assessment programs in which test design takes place in English while tests areadministered in many different languages (i.e., PISA, RC, RL). In Column seven,information regarding the mean age of the sample for each study is collated in order toexamine whether or not the possible gender gap in reading increases or decreases with age.

The next five columns indicate to which assessment program, if any, an individualstudy belongs. Results reported by Levine and Ornstein (1983), for example, drew on dataprovided by NAEP, which was the reason for assigning a "1" to this study in the columnheaded "NAEP" and a "0" under the PISA, Reading Comprehension and Reading LiteracyStudy headings, as the data were not gathered as part of these assessment programs. Thesedummy variables are recorded to allow for the examination of possible effects stemmingfrom the overall design of each of these assessment programs.

In the last column headed "ES Mean", studies are classified according to whether thecalculation of the effect size is based on primary study results in the form of means (inwhich case the study is coded "1") or correlation coefficients (in which case the study iscoded "0"). Through the inclusion of this information in the meta-analysis, it is possible tosee whether systematic differences occur in effect sizes as a result of the method ofcalculating effect sizes.

Below follows a short description of the assessment programs from which most ofthe primary study results in the meta-analysis are taken.

Page 12: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344328

Reading Comprehension Study

The first large-scale cross-national survey of reading was conducted by theInternational Association for the Evaluation of Educational Achievement (IEA) in fifteeneducation systems as the Reading Comprehension Study which formed part of IEA's SixSubject Survey in 1970/71. The reading comprehension test consisted of eight passages and52 multiple-choice test items that were designed to measure four categories, namely theability to: (a) follow the organization of a passage; (b) respond to questions that arespecifically answered in the passage; (c) draw inferences from a passage, and (d) determinethe writer's purpose. Items were administered to a representative sample of 14-year-oldstudents in each of the participating education systems (Thorndike, 1973).

Reading Literacy Study

The Reading Literacy Study was the next study of reading performance conductedby IEA in 1990/91. This time, 31 education systems participated at the 14-year-old level.As in the first study, samples representative of the target population were drawn in eachcountry under the supervision of an international sampling referee. The design of thereading test had shifted from an emphasis on skills to an emphasis on different types ofreading materials. As a consequence, students had to answer a total of 89 multiple-choiceitems relating to 19 passages (Elley, 1994).

Programme for International Student Assessment (PISA)

In the late twentieth century, the Organization for Economic Co-operation andDevelopment (OECD) launched its Programme for International Student Assessment(PISA) with the main aim to compare the performance of students towards the end ofcompulsory schooling in key subject areas, namely Mathematics, Reading and Scienceacross its member countries. The focus of the first round of data collection in 2000, inwhich a total of 43 OECD and non-OECD member countries participated, was on reading.The reading test assessed performance on five processes, namely retrieving information,forming a broad general understanding, developing an interpretation, reflecting on andevaluating the content of a text, and reflecting on and evaluating the form of a text. Itemswere of the multiple choice as well as the open constructed-response type and related tocontinuous and non-continuous texts. Each participating country had to survey a nationallyrepresentative sample of 15-year-old students and comply with the sampling guidelines ofthe OECD (Adams & Wu, 2002).

NAEP Studies

The National Assessment of Educational Progress (NAEP) is an assessment programrun by the National Center for Education Statistics (NCES) in the United StatesDepartment of Education. Since 1969, NAEP has conducted studies in a number of subjectareas, including reading, to assess achievement levels of nationally representative studentsamples in Grades 4, 8, and 12. In the most recent reading test design, students were

Page 13: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344 329

assessed on four aspects of reading. These covered: (a) forming a general understanding;(b) developing interpretation; (c) relating information in the text to own knowledge andexperience; and (d) examining content and structure, which requires critical evaluation andan appreciation of the effects of text features such as irony, humour and organization. Tothis end, the reading comprehension test employed multiple-choice questions, designed totest students' understanding of individual texts, as well as their ability to integrate andsynthesize ideas across the texts and constructed-response questions, which requiredstudents to construct their own answers (Plisko, 2003).

Australian Studies

In Australia, data on the reading performance of secondary school students wereavailable from a number of studies. They included the 1975 and 1980 studies AustralianStudies in School Performance (ASSP) and Australian Studies in Student Performance(1980), the Youth in Transition study (YIT) in 1989, and the Longitudinal Surveys ofAustralian Youth (LSAY) that were conducted in 1995 and 1998. The ASSP data includednational samples at both ages 10 and 14, whereas the Youth in Transition and longitudinalsurveys collected data from 14-year-olds only (Rothman 2002).

The reading tests used in these various studies were not the same. The 1975 test wasdesigned to assess minimum competency, and therefore focused on the lower levels ofachievement, while the later tests generally covered a wider range of student performance.However, all tests contained a number of common items, which were used in the analysisof trends in reading achievement over time (de Lemos, 1997; Marks & Ainley, 1996).

The "Monitoring Standards in Education" (MSE) program in Western Australiastarted with the "Random Sample" assessment program in 1990 with data collections thatoccurred in 1992, 1995, 1997, 1999 and 2001 whereby 10% of students in each of Grades3, 7, and 10 were tested. In 1998, the Western Australian Literacy and NumeracyAssessment (WALNA) population testing began with Grade 3 students. Subsequently, theassessment of Grade 5 was introduced and Grade 7 was also included. Data collection fromGrade 10 students has continued to be undertaken from the Random Sample assessmentprogram. Reading performance was assessed on a range of texts that included continuoustexts, for example poems, media releases, narrative extracts, as well as non-continuoustexts such as charts or tables.

One might think that the focus of the current meta-analysis on gender differences inreading achievement at the secondary school level was sufficiently narrow to allow for arelatively straight-forward investigation. Unfortunately, this was not the case. Studieswhich were retrieved as a result of the literature search differed markedly not only indesign, sample size, scope and the scale of the reading score but also in the reporting ofresults. Thus, results were frequently not reported in terms of standardized effect sizes butin terms of correlation coefficients, regression coefficients from single-level and multi-levelanalyses, sums of squares, percentage differences or mean differences. Hence, some formof standardization of the results reported by the different studies was required in order toarrive at a metric-free effect size (ES) that could be processed further in the meta-analysis.The formulae which were employed in the conversion of correlation coefficients,

Page 14: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344330

standardizes scores, and proportions of test items answered correctly to standardized effectsizes are given in Appendix I.

As the next step, a so called "v-known" hierarchical linear model analysis(Raudenbush, Bryk, Cheong, & Congdon, 2001; Hox 1995) was undertaken. "V-known"models may be considered a special case of a two-level hierarchical linear model. Ingeneral, hierarchical linear models seek to take into consideration the nested structure ofmany data sets whereby, for example, students (Level-1) are nested within schools (level-2). In these instances, variation in the outcome variable at Level-1 – frequently a measureof student performance in some subject area – is sought to be explained by variables atlevel-1 – for example gender or socio-economic status or homework effort – as well as byvariables at Level-2 – for example school resources, size of school, or location of school. Ina meta-analysis the hierarchical structure of the data is such that the within-study variationis modelled at Level-1 while between-study variation is used at Level-2 to explainvariability at Level-1. In other words, multilevel modeling as applied to meta-analysisproceeds in two steps. First, it examines whether the within-study results at Level-1 arehomogeneous or heterogeneous. If results are homogeneous, the effect sizes may becombined into one average outcome. If the results are heterogeneous, between-studycharacteristics such as type of study design or type of study participants are examined atlevel-2 to see whether or not they contribute to explaining differences in results. The reasonwhy Raudenbush and Bryk (2002) labelled these multilevel models for meta-analysis "v-known models" stems from the fact that the variability at level-1 is considered to besampling variability which is known if the relevant sampling distribution and sample sizesare known. Below, the "v-known" HLM meta-analysis is worked through for the currentmeta-analysis based on the considerations put forward by Raudenbush and Bryk (2002, pp.208-210).

The effect size (ES) estimate, dj, for most of the studies in Table 2 is thestandardized mean difference between the average reading scores for female and malestudents:

dj = ( Ej Cj ) / Sj [1]

where

Ej is the average reading score for the experimental group, that is, female students

Cj is the average reading score for the control group, that is, male studentsSj is the pooled, within-group standard deviation

Each of the effect sizes recorded in Table 2 is an estimate of the population meandifference between the experimental group, which in this context, consists of femalestudents and the control group, which, in this instance, is constituted by male students.Thus, in the second study in Table 2, for example, female students score one tenth of astandard deviation higher than male students.

With reference to Hedges (1981), Raudenbush and Bryk (2002) stated that dj followsa normal distribution with variance Vj where

Page 15: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344 331

Vj = (nEj + nCj ) /(nEjnCj ) + j

2/[2(nEj + nEj )] [2]

and assert that "it is common to substitute dj for j and then assume that Vj is "known""(Raudenbush & Bryk, 2002, p. 209). While the above formula applies to instances whereeffect sizes are calculated on the basis of mean differences, the following formula applies toeffect sizes calculated on the basis of correlation coefficients:

Vj =1/(nj 3) [3]

In order to define the hierarchical model for meta-analytic problems, equations haveto be formulated at two levels. The model at Level-1, that is the within-study model, is:

dj = j + ej [4]

where each of the effect sizes for the 139 studies in the current meta-analysis is consideredto be one estimate of the underlying population parameter j plus the sampling errorassociated with each estimate, ej with ej N(0, Vj).

At level-2, study characteristics and random error are considered to predict theunknown effect size j . Thus, the model at Level-2, that is the between-study model, is:

j = 0 + 1W1j + 2W2j + ... 8W8j + u j [5]

Where:W1j,...,W8j are the study characteristics, namely:

(a) Two general predictor variables:W1=English as the language of test administration

W2=age

(b) Assessment program, if any, to which a study belongs:W3=RCW4=RLW5=PISA-2000W6=NAEPW7=Australian studies (Aus);

(c) The basis for calculating the effect size:W8=effect size based on mean differences (as opposed to the other studiesin the meta-analysis whose effect sizes are based on correlationcoefficients)

o,…, 8 are the regression coefficients associated with the study characteristics W1 toW8

Page 16: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344332

uj is level-2 random error where uj N (0, )

In order to combine the two-levels into a single model, dj in Equation 4 has to bereplaced by Equation 5:

dj = 0 + 1W1j + 2W2j + ... sWsj + u j + e j [6]

In summary, the Level-1 outcome variable in the current meta-analysis is the effectsize which quantifies the difference between male and female students' performance inreading reported by each study. In case the variation in effect sizes is found not to be due tochance, the analysis reveals the extent to which variables W1 to W8 contribute to explainingthe variance.

W1 and W2 are specified to examine the potential effects of two variables, namelywhether or not English is the language of testing and the average age of the students in aparticular study. This allows the examination of two questions.

(1) As most of the instrument construction for international tests is undertaken inEnglish and with an eye on gender equitable materials in that language, genderdifferences may be less pronounced in countries where English is the language ofinstruction and test administration.

(2) As male students mature later than female students and reading is basically a processof reasoning (Lietz, 1996; Thorndike, 1917), gender differences may decrease withincreasing age.

The effect sizes used in this meta-analysis are taken from large-scale national andinternational studies. Thus, in order to examine possible systematic impact of theassessment program designs on effect sizes for each group of studies, dummy variables W3

toW7 are created to indicate to which group of studies an individual study result belongs. Inthis way, it is possible to investigate whether or not any gender differences are larger as aconsequence of a particular feature in the design of, for example, the NAEP studies whencompared with all other studies.

Finally, the variable W8 is generated as an additional dummy variable to indicatewhether effect sizes have been calculated on the basis of mean differences (coded 1) orcorrelation coefficients (coded 0). The inclusion of this dummy variable as a predictor atLevel 2 allows the investigation of whether or not the effect sizes vary as a consequence ofusing either mean differences or correlation coefficients as a basis for calculation. In thisway, the study seeks to contribute to Cohen's idea of "universal ES index" by examiningthe extent to which effect sizes based on different measurement units contribute to thebetween-study variance in the meta-analysis.

To this point, the ES has been considered quite abstractly as a parameter whichcan take on varying values (including zero in the null case). In the previousillustrations, ES was variously expressed as a departure in percent from 50, adeparture in IQ units from 100, a product moment correlation coefficient, adifference between two medians in dollars, etc. It is clearly desirable to reduce

Page 17: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344 333

this diversity of units as far as possible, consistent with present usage bybehavioral scientists. From one point of view, a universal ES index, applicableto all the various research issues and statistical models would be the ideal.Apart from some formidable mathematical-statistical problems in the way,even if such an ideal could be achieved, the result would express ES in termsso unfamiliar to the research in behavioral science as to be self-defeating.(Cohen, 1988, p. 11)

In the following section, results of a two-level hierarchical linear analysis with datafrom Table 2 are reported.

Results

As mentioned above, the first step in a meta-analysis using HLM is to examinewhether the effect sizes from the different primary studies are homogenous orheterogeneous. In the case where heterogeneity can be ascertained, the analysis proceeds toinvestigating the way in which possible study characteristics may contribute to thevariability in effect sizes.

It can be seen in Table 3 that the estimated grand-mean effect size, the intercept inthe model, is positive and small, g0=0.19, which means that, on average, female secondarystudents performed about 0.19 standard deviation units above male secondary students.Note that the degrees of freedom is 138, one fewer than the number of studies in theanalysis, as one degree of freedom is needed for the estimation of the unconditional model.

Furthermore, the estimated variance of the effect parameter is 0.014 whichrepresents a standard deviation of 0.12 indicating important variability in the true-effectsizes. Moreover, the c2 value (2521.478) and corresponding p-value (0.000) confirm thatthis variance is not due to chance and that the residual variance is significantly differentfrom zero. As a consequence, the analysis can proceed to examine which of the predictorvariables that reflect study characteristics might explain this variance.

Table 3: Final Estimation of Fixed Effects: Unconditional 'v-Known' Model

Final estimation of fixed effects:

Fixed Effect Coefficient StandardError

t-ratio Approx.df

P-value

For EFFSIZE, B1INTRCPT2, G10 0.189559 0.014489 13.083 138 0.000

Final estimation of variance components:

Random Effect StandardDeviation

VarianceComponent

df Chi-square P-value

EFFSIZE, U1 0.15580 0.02427 138 2521.47793 0.000

Statistics for current covariance components modelDeviance = 933.076669Number of estimated parameters = 2

Page 18: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344334

In order to examine which of the predictor variables contribute to differences ineffect sizes between studies, a HLM model which included all possible predictors asspecified in Equations 4, 5, and 6 above was examined and the results are presented inTable 4. Note that the degrees of freedom are now reduced to 130 as, in addition to theintercept, eight potential Level-2 predictors need to be estimated.

Table 4: Final Estimation of Fixed Effects: 'v-Known' Model with all Possible PredictorsIncluded

Final estimation of effects:

Fixed Effect Coefficient StandardError

t-ratio Approx.df

P-value

For EFFSIZE, B1INTRCEPT2,

ENG,AGE,RC,RL,

PISA,NAEP,AUS,

ESMEAN,

G10G11G12G13G14G15G16G17G18

0.061057-0.0050070.002276-0.0686250.0381970.2589350.2133710.218313-0.040833

0.1458090.0294120.0104840.0603170.0405890.0390540.0501560.0410820.055645

0.419-0.1700.217-1.1380.9416.6304.2545.314-0.734

130130130130130130130130130

0.6750.8650.8280.2560.3470.0000.0000.0000.463

Final estimation of variance components:

Random EffectStandardDeviation

VarianceComponent

df Chi-square P-value

EFFSIZE, U1 0.10092 0.01018 130 1038.79389 0.000

Statistics for current covariance components in model

DevianceNumber of estimated parameters

= 846.682527= 2

Of the eight possible predictors three emerge with significant effects whereas five donot contribute to explaining study result variability. Thus, age and whether or not English isthe language of test-administration do not emerge as significant predictors of genderdifferences. In other words, the gender gap does not decrease with age, which may havesupported the maturational viewpoint whereby reading comprehension is also a function ofmaturity and, since boys mature at a later age, differences between boys' and girls' readingperformance may decrease with increasing age. Likewise, there is no evidence to suggestthat gender differences are more or less pronounced in countries where English is not thelanguage of test administration.

Similarly, the effect of a study forming part of the Reading Comprehension orReading Literacy assessment programs is not significant. Indeed, the small but negativecoefficient for the Reading Comprehension Study (-0.07) illustrates that the design of thisassessment program is more conducive to shifting the gender differences in readingperformance in favour of boys, but not significantly. The small positive effect for the

Page 19: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344 335

Reading Literacy Study (0.04) indicates that gender differences are nearly non-existent inthe Reading Literacy Study.

The non-significant effect of the dummy variable "ESMEAN", which indicateswhether or not the effect size was calculated on the basis of mean differences, reveals thatthe different measurement units on which the effect sizes are based do not contribute to thebetween-study variance in the meta-analysis. This finding seems to suggest that nodifferences arise from the way in which effect sizes are calculated which, in turn, providesempirical evidence in support of Cohen's idea of a "universal ES index".

In order to arrive at a final hierarchical model in this meta-analysis of genderdifferences, the between-study variables that did not contribute significantly to explainingdifferences in effect size were removed whereby the least important one, namely age, wasremoved first, followed by English, Reading Literacy, the effect size estimate based onmean differences and Reading Comprehension. Only variables with significant effects wereretained in the final model and the results are shown in Table 5.

Significant positive effects emerge for "NAEP", "AUS" and "PISA" whereby thelatter effect is the largest, indicating that gender differences are most pronounced in thePISA-2000 assessment program, followed by the Australian assessment programs, inparticular the more recent studies, and the NAEP program. This means that in theseassessment programs girls outperform boys to a greater extent when compared with allother studies. It is interesting to note that these results point to greater gender differencesfor assessment programs that have been undertaken more recently.

Table 5: Final Estimation of Fixed Effects: Final 'v-Known' Model (only significant predictorsretained)

Final estimation of effects:

Fixed Effect Coefficient StandardError

t-ratio Approx.df

P-value

For EFFSIZE, B1INTRCEPT2,

PISA,NAEP,AUS,

G10G11G12G13

0.0670180.2457570.1952040.198029

0.0151120.0222710.0434250.029702

4.43511.0354.4956.667

135135135135

0.0000.0000.0000.000

Final estimation of variance components:

Random EffectStandardDeviation

VarianceComponent

df Chi-square P-value

EFFSIZE, U1 0.09919 0.00984 135 1067.48866 0.000

Statistics for current covariance components in model

DevianceNumber of estimated parameters

= 781.986451= 2

A comparison of the deviance values allows an evaluation of the three models underreview, namely the unconditional model, the model which includes all eight predictors and

Page 20: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344336

the final model with three predictors. Thus, the deviance which is highest for theunconditional model with a value of 933.077 is reduced to 846.683 for the second model.For the final model, in turn, the deviance is further reduced to a value of 781.986 whichindicates that the latter model provides the best fit to the data.

The variance estimates of the unconditional model (0.02427) and the final model(0.00984) can be used to calculate the proportion of variance explained in study results.Thus, the final v-known model explains 59.46% ((0.02427-0.00984)/0.02427) of thevariability in results. Complementary information is provided by the chi-square (1067.489)and p-values (0.000) computed for the estimated variance of the effect parameters in thefinal model of t̂ = 0.0098 which corresponds to a standard deviation of 0.099 and indicatesthat important variability still exists in the effect sizes. Thus, while between-study variablesincluded in the final v-known model explain more than half of the differences in effectsizes a considerable amount of variability remains to be explained by factors other thanthose included in this meta-analysis.

Conclusion

In this study, a meta-analysis of large-scale studies between 1970 and 2002 in thearea of reading achievement at the secondary school level with a focus on genderdifferences was conducted. The meta-analysis was conceptualized as a special applicationof a two-level hierarchical linear model whereby in a first step, it was examined whetherthe effect sizes differed more than could be expected due to sampling error. Once resultshave been ascertained to be sufficiently heterogeneous, the analysis proceeds to examiningstudy characteristics at Level-2 and the way in which they may be able to explaindifferences between effect sizes at Level-1. Level-2 variables included in the hierarchicallinear model covered whether or not a particular study in the meta-analysis formed part ofone of the large-scale assessment programs, the basis of calculation for the effect size, theage of study participants and whether or not a study was conducted in a country whereEnglish was the language of test administration.

Results indicated that (a) gender differences existed across the 139 studies underreview that were not due to chance; and (b) slightly over half of these differences could beexplained by differences in the design of some of the large-scale assessment programsincluded in the meta-analysis and the basis for calculating effect sizes.

Thus, the gender gap in favour of girls was even more pronounced for theassessment programs conducted by the National Assessment Program for Education in theUnited States, for the more recent assessment programs in Australia and the Program forInternational Student Assessment conducted by the OECD. In addition, variables such asage or whether or not English was the language of test administration did not emerge asfactors that had an impact on gender differences.

With respect to possible differences stemming from the different bases of calculatingeffect sizes (i.e., mean differences or correlation coefficients) results showed that such animpact was not significantly different from zero. This finding provided supportive evidencefor Cohen's idea of a "universal effect index".

Page 21: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344 337

While the variables included in the final model explained more than half thevariability in effect sizes between studies, a substantial amount of variance remained to beexplained by factors other than those included in the analysis.

With a view to further research, it would be of interest to examine whether theseresults still hold if studies of narrower scope are included in the meta-analysis. In addition,a careful analysis of the differences in the definitions and operationalisation of readingliteracy could illuminate reasons for the greater or smaller gender differences in the variousstudies. Finally, the reasons for the greater gender differences in more recent assessmentprograms which could, for example, be related to item selection procedures, contextualchanges surrounding reading in society and at school or the scaling of reading scores,warrant further scrutiny.

Note

1. The analyses for this study were conducted while the author was a visiting scholar at the FlindersInstitute of International Education in Adelaide, South Australia.

References

* indicates reports from which data were used in the meta analysis.

Adams R., & Wu M. (Eds.). (2002). Programme for international student assessment (PISA): PISA2000 technical report. Paris: OECD.

Bangert-Drowns, R.L., Hurley, M., & Wilkinson, B. (2004). The effects of school-based writing-to-learn interventions on academic achievement: A meta-analysis. Review of Educational Research, 74(1), 29-58.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences, (2nd ed.), Hillsdale, NJ:Erlbaum,

Cohen, P.A., Kulik, J.A., & Kulik, C.-L. C. (1982). Educational outcomes of tutoring: A meta-analysis of findings. American Educational Research Journal, 19(2), 237-248.

Cook, T.D., Cooper, H., Cordray, D. S., Hartmann, H., Hedges, L.V., Light, R.J., Louis, T.A., &Mosteller, F. (1992). Meta-analysis for explanation: A casebook. New York: Russell Sage Foundation.

Dedze, I. (1995). Reading achievement within the educational system of Latvia: Results from theIEA Reading Literacy Study. Paper presented at the annual meeting of the American Educational ResearchAssociation, San Francisco, CA, April 18-22.

deLemos, M.M. (1997). Reading standards up or down - What do the test norms say? Paperpresented at the annual conference of the Australian Association for Research in Education, Brisbane.http://www.aare.edu.au/97pap/delem483.htm Retrieved: 04/06/05.

Department of Education and Training, WA (2003). Student achievement in English in WesternAustralian government schools. Reading 1992-2002, Writing 1999-2002. East Perth, WA.http://www.eddept.wa.edu.au/mse/reporting.html#pdfs Retrieved: 18/05/05.

Page 22: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344338

*Elley, W.B. (Ed.). (1994). The IEA study of reading literacy: Achievement and instruction inthirty-two school systems. Oxford: Pergamon.

*Fuller, B., Hua, H., & Snyder, C.W. Jr. (1994). Focus on gender and academic achievement. Whengirls learn more than boys: The influence of time in school and pedagogy in Botswana. ComparativeEducation Review, 38 (3), 347-376.

*Gambell, T., & Hunter, D. (2000). Surveying gender differences in Canadian school literacy.Journal of Curriculum Studies, 32(5), 689-719

Glass, G.V. (1976). Primary, secondary and meta-analysis of research. Educational Researcher, 5,3-8.

Glass, G.V. (1977). Integrating findings: The meta-analysis of research. Review of Research inEducation, 5, 351-379.

Glass, G.V., McGaw, B., & Smith, M.L. (1981). Meta-analysis in social research. Beverly Hills:Sage.

*Glossop, J.A, Appleyard, R., & Roberts, C. (1979). Achievement relative to a measure of generalintelligence. British Journal of Educational Psychology, 49, 249-257.

*Gorman, T.P., White, J., Orchard, L., & Tate, A. (1982). Language performance in schools.Secondary survey report no 1. Department of Education and Science, Welsh Office, Department ofEducation for Northern Ireland. London: Her Majesty's Stationery Office.

Hedges, L.V. (1981). Distribution theory for Glass's estimator of effect size and related estimators.Journal of Educational Statistics, 6, 107-128.

Hedges, L.V. (1994). Fixed effect models. In H. Cooper & L.V. Hedges (Eds.), The handbook ofresearch synthesis (pp. 285-299). New York: Russel Sage Foundation.

*Hedges, L.V., & Nowell, A. (1995). Sex differences in mental test scores, variability, and numbersof high-scoring individuals. Science, 269, 41-45.

Hedges, L.V., & Olkin, I. (1983). Regression models in research synthesis. The AmericanStatistician, 37 (2), 137-140.

*Hogrebe, M.C., Nist, S.L., & Newman, I. (1985). Are there gender differences in readingachievement? An investigation using the high school and beyond data. Journal of Educational Psychology,77 (6), 716-24.

Hox, J.J. (1995). Applied multilevel analysis. Amsterdam: TT-Publikaties.

Hunter, J.E., & Schmidt, F.L. (2004). Methods of meta-analysis. Correcting error and bias inresearch findings (2nd ed.). Thousand Oaks, CA: Sage.

Hunter, J.E., Schmidt, F.L., & Jackson, G.B. (1982). Meta-analysis. Cumulating research findingsacross studies. Beverly Hills, CA: Sage.

Page 23: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344 339

*Johnson, D.D. (1973-1974). Sex differences in reading across cultures. Reading ResearchQuarterly, 9 (1), 67-86.

Keppel, G. (1991). Design and analysis: A researcher's handbook. Englewood Cliffs, NJ:Prentice Hall.

Knickerbocker, J.L. (1989). Sex differences in reading achievement: A review of recent research.Ohio Reading Teacher, 24 (2), 33-42.

Kulik, J.A., & Kulik, C.-L.C. (1989). Meta-analysis in education. International Journal ofEducational Research, 221-340.

Kulik, J.A., Bangert, R.L., & Williams, G.W. (1983). Effects of computer-based teaching onsecondary school students. Journal of Educational Psychology, 75 (1), 19-26.

*Levine, D.U., & Ornstein, A. C. (1983). Sex differences in ability and achievement. Journal ofResearch and Development in Education. 16 (2), 66-72.

Lietz, P. (1996). Reading comprehension across cultures and over time. Münster/New York:Waxmann.

Light, R.J., & Smith, P.V. (1971). Accumulating evidence: Procedures for resolving contradictionsamong different research studies. Harvard Educational Review, 41 (4), 429-471.

Marks, G.N., & Ainley, J. (1996). Reading comprehension and numeracy among junior secondaryschool students in Australia. Longitudinal Surveys of Australian Youth, Research Report Number 3.Melbourne: Australian Council for Educational Research.

Neuman, S.B. (1986). Television viewing and reading: A research synthesis. Paper presented at theInternational Television Studies Conference, London, England, July 10-12, 1986.

*Neuman, S.B., & Prowda, P. (1982). Television viewing and reading achievement. Journal ofReading, 666-670.

*Oakland, T., & Stern, W. (1989). Variables associated with reading and math achievement amonga heterogeneous group of students. The Journal of School Psychology, 27, 127-140.

Pedhazur, E.J. (1982). Multiple regression in behavioural research: Explanation and prediction(2nd Ed.). Fort Worth, TX: Holt, Rinehart & Winston.

*Plisko, V.W. (2003). The release of the National Assessment of Educational Progress (NAEP) Thenation's report card: Reading and mathematics 2003 http://nces.ed.gov/commissioner/remarks2003/11_13_2003.asp. Retrieved: 18/05/05.

Preston, R.C. (1962). Reading achievement of German and American children. School and Society,90, 350-354.

Raudenbush, S.W., & Bryk, A.S. (1985). Empirical Bayes meta-analysis. Journal of EducationalStatistics, 10 (2), 75-98.

Page 24: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344340

Raudenbush, S.W., & Bryk, A.S. (2002). Hierarchical linear models: Applications and dataanalysis methods. Thousand Oaks, CA: Sage.

Raudenbush, S.W., Bryk, A.S., Cheong, Y.F., & Congdon, R. (2001). HLM 5. Hierarchical linearand nonlinear modeling. Lincolnwood, IL: SSI Scientific Software International.

Rosenthal, R. (1984). Meta-analytic procedures for social research. Newbury Park, CA: SagePublications.

*Rothman, S. (2002). Longitudinal surveys of Australian youth. Research report number 29.Achievement in literacy and numeracy by Australian 14-year-olds, 1975-1998. Hawthorne, Vic.: AustralianCouncil for Educational Research. Report available from http://www.acer.edu.au/research/lsay/reports/LSAY29.pdf. Retrieved: 13/07/04.

Saito, M. (1998). Gender vs. socio-economic status and school location. Differences in Grade 6reading literacy in five African countries. Studies in Educational Evaluation. 24 (3), 249-61.

*Shilling, F., & Lynch, P. D. (1985). Father versus mother custody and academic achievement ofeighth grade children. Journal of Research and Development in Education. 18 (2), 7-11.

Slavin, R.E. (1984). Meta-analysis in education: How has it been used? Educational Researcher, 13,6-15.

Slavin, R.E. (1986). Best-evidence synthesis: An alternative to meta-analysis and traditionalreviews. Educational Researcher, 15, 5-11.

Stahl, S.A., & Miller P.D. (1988). The language experience approach for beginning reading: Aquantitative research synthesis. Washington, DC: United States Office of Education Report.

Thorndike, E.L. (1917). Reading as reasoning; A study of mistakes in paragraph reading. Reprintedin Reading Research Quarterly, 1971, 6 (4), 425-434.

Thorndike, R.L. (1973). Reading comprehension education in fifteen countries. InternationalStudies in Evaluation III. International Association for the Evaluation of Educational Achievement.Stockholm: Almqvist and Wiksell.

*U.S. Department of Education, Institute of Education Sciences, National Center for EducationStatistics, National Assessment of Educational Progress (NAEP), 2003, 2002, 1998, 1994 and 1992 ReadingAssessments. NAEP Data Tool v3.0, http://nces.ed.gov/nationsreportcard/naepdata/getdata.asp, searchoptions "subject"=reading; "Grade"=Grade 8 and Grade 12; "State/Jurisdiction"=Nation; Category=Majorreporting groups, after pressing "continue" select "gender". Retrieved: 06/07/04 and 06/06/05.

Wagemaker, H., Taube, K., Munck, I., Kontogiannopoulou-Polydorides, G., & Martin, M. (1996).Are girls better readers? Gender differences in reading literacy in 32 countries. Amsterdam: InternationalAssociation for the Evaluation of Educational Achievement.

*Youngman, M.B. (1980). Some determinant of early secondary school performance. BritishJournal of Educational Psychology, 50, 43-52.

Page 25: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344 341

The Author

PETRA LIETZ is assistant professor of quantitative research methods at InternationalUniversity Bremen, Germany where she lectures in the logic of comparative research andsecondary data analysis. Her main areas of research interest and publication areinternationally comparative educational research, multivariate and multilevel modeling,ESL/EFL and meta-analysis.

Correspondence: <[email protected]>

Page 26: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344342

Appendix: Calculation of Effect Sizes for Studies in the Meta-Analysis

1. For the Australian studies (reported by Rothman, 2002)

Reported SD of 10. Therefore:

d =XF XM

10

2. For studies reporting means and standard deviation for males, means and standard deviationfor females and number of cases for each sex (e.g., WA monitoring studies, Hogrebe et al.,1985; Johnson, 1973-74)

d =XF XM

(NF -1)sF2+ (NM 1)sM

2

NF +NM 2

which is the mean for females minus the mean for males divided by the within-group (also called"pooled") standard deviation (see Hunter et al., 1982, p. 98).

The reason for using the within-group standard deviation instead of the control-groupstandard deviation is that the within-group standard deviation has only about half the samplingerror of the control-group standard deviation.

In addition, Cohen (p. 11) states that "…the ES index for differences between populationmeans is standardized by division by the common within-population standard deviation."

The reason for subtracting male mean from female mean is that higher average readingperformance is expected for females. As a consequence, positive effect sizes denote superiorperformance of females whereas negative effect sizes denote superior performance of males.

3. For the Botswana study (reported by Fuller, Hua, & Snyder, 1994)

Using t-test values and number of cases to calculate effect size.First step: Calculate correlation coefficient r from t-test (see Hunter et al., 1982, p. 98):

r =t

t2 +N 2

Second step: Calculate effect size d based on r:

d =r

1 r2 p q

where p is the proportion of females and q is the proportion of males in the sample.Note that Hunter et al. (1982, p. 98) state that in the case of equal sample sizes for the two

groups "[…] for small correlations, this means d=2r[…]".

Page 27: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344 343

4. For the UK study which reported means for females and males plus the respective standarderrors and not the standard deviation

Cohen (p. 6)"..one conventional means for assessing the reliability of a statistic is the standard error (SE) of thestatistic. If we consider the arithmetic mean of a variable X (X ), its reliability may be estimated bythe standard error (SE) of the mean (SE(X)):

First, obtain SD from SE:

SEX

= SD2

n

therefore

SEX

= SD

n

therefore

SD = SEX n

Then, replace SD by SEX n into "ordinary" formula for d to calculate effect size:

d = XF XM

NF SEF NF + NM SEM NM

NF + NM 2

5. For the NAEP studies mean differences (directly off website)

Reported SD of 50. Therefore:

d = XF XM

50

6. For PISA 2000 studies

Achievement scores scaled to a mean of 500 and a standard deviation of 100 (Adams & Wu, 2002).Therefore:

d = XF XM

100

7. For studies reporting correlation coefficients (includes the Reading Comprehension Study)

Page 28: A META-ANALYSIS OF GENDER DIFFERENCES IN READING ACHIEVEMENT AT THE SECONDARY SCHOOL LEVEL

P. Lietz / Studies in Educational Evaluation 32 (2006) 317–344344

d = r

1 r2 p q

where p is the proportion of females and q the proportion of males in the sample.

8. For studies reporting partial correlation coefficients, regression weights, or gammas

These are considered a "purer" estimate of the relationship between gender and readingachievement as the effect of other variables has been partialled out. In other words, these measuresprovide information on the strength of the relationship between gender and reading achievementafter the influence of other variables on the relationship has been taken into account. In line withthis argument, betas or gammas of the most complex models were used as a basis for calculatingthe effect size as these were considered to be the "purest" estimates of the relationship betweengender and reading achievement, taking into account the most other variables.

In line with Pedhazur (1982) regression coefficients can be considered similar in nature tocorrelation coefficients. Hence, the same formula as for correlation coefficients was used in thecalculation of effect sizes from partial correlations, regression coefficients and gammas (fromhierarchical linear models).

9. For studies reporting sum of squares as a result of ANOVA analyses (Oakland & Stern,1989)

The idea that it is legitimate to use the following formula in calculating effect size based onsums of squares has been put forward by Keppel (1991, p. 437-444).

d =SSSex

SSTotal

Like partial correlation or regression coefficients, this measure is considered to be "purer" asthey take into account other variables, such as race and SES in the analysis by Oakland and Stern(1989).

10. For the Reading Literacy Study

According to Cohen's formula (1988, p. 20):

d = XF XM

whereby values for means for males and females were taken from Purves and Elley (in Elley, 1994,p. 106) and the pooled standard deviation for the overall reading score was taken from Elley &Schleicher (in Elley, 1994, p. 57)