gold-measurement-properties-preliminary

1

The Measurement Properties of the Teaching Strategies GOLD™ Assessment System:

Preliminary Results Following the Spring Assessment Checkpoint

Richard G. Lambert and Do-Hong Kim

Center for Educational Measurement and Evaluation The University of North Carolina at Charlotte

June 2010

SampleThe total sample for the third phase of the Teaching Strategies GOLD™ assessment system field test included 2,465 children. The children in this nationally representative sample received preschool services in 40 different centers that are located across the United States. Most of these centers use The Creative Curriculum® for Preschool and had been assessing children by using The Creative Curriculum® Developmental Continuum for Ages 3–5 before this study. A total of 165 different raters (teachers) provided the ratings for the study. Each teacher received training in the use of the Teaching Strategies GOLD™ assessment system and rated an average of 15 children.

The children in the study ranged in age from 4–72 months. The percentages of children in the sample in each 6-month age-group are reported in Table 1. The sample was split almost exactly evenly between boys (50.1 percent) and girls (49.9 percent). English was the primary language spoken in the homes of 45.8 percent of the children. Spanish was the primary language spoken in the homes of 47.4 percent of the children. The remaining 6.8 percent of the children lived in homes where the primary language spoken was one of 29 other languages. Of all the children in the sample, 7.8 percent have disabilities. Children with an Individualized Family Service Plan (IFSP) comprised 1.4 percent of the sample, and children with an Individualized Education Program (IEP) comprised 6.4 percent of the sample.

2 The Measurement Properties of the Teaching Strategies GOLD™ Assessment System

The children spanned the entire age range for which the assessment system is intended (birth through kindergarten). It is important to note that the field test analyses discussed in this report were conducted with unweighted data. Children who are English-language learners were oversampled intentionally.

Validity

Factor Analysis

The first step of the validity analysis was exploratory factor analysis using Principal Axis Factoring and direct oblimin rotations. A five-factor solution accounted for 83.64 percent of the variance in item responses. The results of this solution are reported in Table 2. Simple structure was clearly achieved in this solution; no item had loadings of greater than .40 on more than one factor. In addition, four of the five factors exactly matched the intended domain of development according to the way the items were organized theoretically by the tool developers. Each of the items related to the Cognitive, Social–Emotional, Physical, and Language domains of development loaded on a factor with all of the other items within their respective domains. The only exceptions were the literacy and mathematics items. These items loaded together on one factor rather than two.

Rasch Analysis

Data were also analyzed by using the Rasch Rating Scale Model (Andrich, 1978) and Winsteps software (Linacre, 2009). A separate Rasch analysis was conducted for each of the five domains of development identified in the factor analysis. Results of the Rasch principal components analysis of residuals (PCAR) showed that the variance in the data explained by the Rasch measures for each of the five scale scores ranged from 83.2–88.7 percent, and the largest secondary dimension accounted for only 2.5–5.2 percent of the unexplained variance. These results, which are reported in Table 3 by scale score, clearly satisfy the Rasch model assumption of unidimensionality. For PCAR, a variance of greater than 50 percent explained by measures is considered good support for scale unidimensionality.

Overall, the rating scale functioned effectively for each of the scale scores. Specifically, the average measure score increased with the category level. The thresholds advanced with the categories, indicating that category 0 is most likely to be observed for children who have relatively lower ratings on the items, whereas category 9 is most likely to be observed for children who have relatively higher ratings. The only exception to this finding was item 6 on the Physical scale, for which disordinality was found between ratings of 0 and 1. This item focuses on gross-motor development. Future research with the measure may focus on whether further refinement of either the behavioral anchors or training for ratings of 0 and

The Measurement Properties of the Teaching Strategies GOLD™ Assessment System 3The Measurement Properties of the Teaching Strategies GOLD™ Assessment System

1 are warranted for this item. The full range of ratings, 0–9, was used by the teachers when rating this sample for all but two of the items. Only the range 0–7 was used for items 19a and 19b, which are more difficult items related to writing.

With very few exceptions, the fit statistics for all of the items were well within acceptable limits. Mean-square fit values between 0.6 and 1.4 are considered reasonable for rating scale items (Bond & Fox, 2007). For the Social–Emotional items, Infit mean-square values ranged from .77–1.13, and Outfit mean-square values ranged from .73–1.10. For the Physical items, Infit mean-square values ranged from .87–1.23, and Outfit mean-square values ranged from .79–1.12. For the Language items, Infit mean-square values ranged from .82–1.24, and Outfit mean-square values ranged from .83–1.23. For the Cognitive items, Infit mean-square values ranged from .81–1.31, and Outfit mean-square values ranged from .78–1.32. For the Literacy and Mathematics items, Infit mean-square values ranged from .71–1.58, and Outfit mean-square values ranged from .69–1.52. Only two items had fit statistics that were out of the suggested acceptable range: 16a (Identifies and names letters) and 19a (Writes name). These results may suggest that it would be helpful during rater training to place greater emphasis on how to collect the appropriate artifacts to facilitate the determination of these ratings.

Item locations within each scale score analysis generally tended to match the distribution of person locations. The thresholds for the steps in the rating scale across items more than completely covered the full range of person locations, suggesting that each of the measures can be used to collect information that can discriminate between children at all levels of development.

Within each domain-specific Rasch analysis, the item location hierarchy appeared to be consistent with the expected developmental trajectory for children who are developing typically. These results can be interpreted as strong construct validity evidence for the scale scores. The rating scale scores indicate that the tool’s developmental indicators are presented in the sequence that matches the progressions of development and learning that the children in the sample are in fact following. In addition, each of the scale scores was moderately highly correlated with child age in months (r = .704 to .740). These results suggest that, while there is some expected variability in developmental levels for children of the same age, older children tend to receive higher ratings than younger children across all domains.


ReliabilityThe internal consistency reliability for each of the scale scores was high, with Cronbach’s alpha coefficients ranging from .961 to .986. These results, along with the Rasch reliability indexes, are reported in Table 4. Based on the Rasch reliability indexes, each of the five scales also appear to be highly reliable, as evidenced by person separation indexes of 3.330–7.270; person reliabilities of .920–.980; item separation indexes of 22.110–40.300; and item reliabilities of nearly 1.000 for all scales.

SummaryReliability statistics indicate that the Teaching Strategies GOLD™ assessment system is highly reliable. Factor analysis shows that the ratings load onto the constructs as intended by the tool-development team. Analyses of the dimensionality of each scale score suggest that the Teaching Strategies GOLD™ assessment system ratings measure five distinct domains of development and that each satisfies the Rasch model assumption of unidimensionality. The fit statistics suggest that the data are a good fit for the Rasch rating scale model. These results also strongly suggest that teachers are able to make valid ratings of the developmental progress of children across the intended age range, from birth through kindergarten.

Table 1 Participating Children by Age

Age (Months) n %

0–6 6 0.24

7–12 35 1.42

13–18 46 1.87

19–24 63 2.56

25–30 55 2.23

31–36 72 2.92

37–42 91 3.69

43–48 332 13.47

49–54 434 17.61

55–60 546 22.15

61–66 644 26.13

67–72 141 5.72


Table 2 Structure Coefficients From Exploratory Factor Analysis

Domain of Development

Item CognitiveLiteracy

and MathSocial–

Emotional Physical Language

1a 0.822

1b 0.833

1c 0.641

2a 0.630

2b 0.795

2c 0.865

2d 0.800

3a 0.872

3b 0.690

4 0.484

5 0.491

6 0.413

7a 0.481

7b 0.418

8a 0.609

8b 0.469

9a 0.813

9b 0.863

9c 0.903

9d 0.766

10a 0.865

10b 0.646

11a 0.810

11b 0.893

11c 0.810

11d 0.899

11e 0.842

12a 0.700

12b 0.711

13 0.595

14a 0.652

14b 0.521

15a 0.654

15b 0.751

15c 0.809

16a 0.901

16b 0.994



Item CognitiveLiteracy

and MathSocial–

Emotional Physical Language

17a 0.577

17b 0.765

18a 0.554

18b 0.675

18c 0.577

19a 0.524

19b 0.669

20a 0.649

20b 0.707

20c 0.835

21a 0.450

21b 0.573

22 0.633

23 0.602

Table 3 Rasch Principal Components Analysis of Residuals

Percent Variance Explained


Rasch Rating Scale

Largest Secondary Dimension

Social–Emotional 85.9 3.5

Physical 87.3 5.2

Language 88.7 2.7

Cognitive 87.5 2.6

Literacy and Mathematics

83.2 2.5

Table 2 (continued)


Table 4 Reliability Evidence and Correlation With Age by Scale Score


Number of Items

Cronbach’s Alpha

Person Separation Index

Person Reliability

Item Separation Index

Item Reliability

Correlation With Age in Months

Social– Emotional

9 0.975 5.310 0.970 30.110 0.999 0.704

Physical 5 0.961 3.330 0.920 22.110 0.999 0.740

Language 8 0.976 5.630 0.970 23.530 0.999 0.718

Cognitive 10 0.983 6.720 0.980 22.130 0.999 0.730

Literacy and Mathematics

19 0.986 7.270 0.980 40.300 0.999 0.720

gold-measurement-properties-preliminary

Documents