arasch!itemresponsemodelingapproach!to!validation ... · 2018-10-10 ·...

A Rasch Item Response Modeling Approach to Validation:

Evidence Based on Test Content and Internal Structure of the

Life Effectiveness Questionnaire

By

Kara Lee Sammet

A dissertation submitted in partial satisfaction of the

requirements for the degree of

Doctor of Philosophy

in

Education

in the

Graduate Division

of the

University of California, Berkeley

Committee in charge:

Professor Mark R. Wilson, Chair Professor John Hurst

Professor Samuel R. Lucas

Fall 2012

A Rasch Item Response Modeling Approach to Validation: Evidence Based on Test Content and Internal Structure of the


Copyright 2012 by

Kara L. Sammet

1

Abstract

A Rasch Item Response Modeling Approach to Validation:

Evidence Based on Test Content and Internal Structure of the


By

Kara Lee Sammet

Doctor of Philosophy in Education

University of California, Berkeley

Professor Mark Wilson, Chair

This study applies the Standards for Educational and Psychological Testing to appraise validity evidence for the interpretation of test scores that are used in experiential education, adventure education, and out-‐of-‐school time programs. Specifically, Rasch item response modeling is applied as a primary tool to examine validity evidence for the Life Effectiveness Questionnaire (LEQ). Hypotheses and assumptions about test content and test structure are evaluated using technically calibrated responses to items as empirical evidence. This evidence, produced in the form of Wright maps, allows for a combined focus on the qualitative, substantive meaning of the test scores, as well as their technical, statistical properties. The partial credit model fit better than the rating scale model for the composite unidimensional and multidimensional models (p>0.05) of the LEQ, and for seven of the eight consecutive unidimensional models of LEQ subscales (p>0.05). DIF analyses found no meaningful statistical evidence of construct-‐irrelevant variance by gender, age, or voluntary status. Analyses of person to item distributions found a mismatch between the intended meaning of item content and empirical item difficulties, so that the test developers’ initial hypothesis about item content was not supported. Evidence from Wright maps indicates no order or sequencing of LEQ test content, and also that items oversample from one easy level of the Life Effectiveness continuum. These findings suggest that item content does not adequately represent the Life Effectiveness construct. Evidence from Wright maps indicates that respondents did not interpret and differentiate between response categories equivalently within items or across subdomains, so that assumptions of interval, Likert scaling that are the basis of the averaged scores for the LEQ are

2

not supported. Wright maps provided evidence of good respondent-‐to-‐item threshold distribution for the composite LEQ. However, item thresholds are skewed to the lower levels of life effectiveness, so that standard errors are higher for respondents with higher ability estimates. Despite reasonable reliability coefficients, sizable standard errors and large confidence intervals associated with the LEQ subscales, in particular, as well as for high ability respondents in the composite LEQ, caution against over-‐confidence in using student ability estimates to make claims about change.

i

Acknowledgements I am immensely grateful to and inspired by Mark Wilson, my committee chair, for his contention that measures must be both substantively meaningful and of the highest technical quality. I thank Mark for his considerable patience and helpful guidance as I found my way through this project. I thank John Hurst and Sam Lucas, my other committee members, for their ongoing support, despite (or because of) the shift in my methods, and for their insistence that I use a critical, sociocultural lens to question assumptions of test development and test use. My deepest gratitude to James Neill for sharing his data with me, for many thoughtful discussions about the development and use of LEQ and, most of all, for supporting a thorough IRT analysis and frank reporting of results, even if findings did not support his own work. James’ intellectual integrity and generosity are unparalleled, and I cannot thank him enough for encouraging my research. Many, many heartfelt thanks to Rachael Jin Bee Tan (who read and commented on drafts of this paper) and Kaysie Dannemiller (who listened to me talk about many drafts of this paper), who were there for me over the long haul, providing me with help and hugs exactly when I needed them most. Thank you so much to my colleagues from the Berkeley Evaluation Assessment Research (BEAR) Center, especially Karen Draney and Steve Moore, who probably endured more than their fair share of my questions about the Rasch model, instrument development, and life in general. Many thanks to John Gargani (http://gcoinc.com/), evaluator extraordinaire, who encouraged me to consider both the practical and technical applications of measurement models for evaluation purposes, and who was a great resource for my statistical and evaluation questions. Thank you to my colleagues at the Institute for the Study of Social Issues, especially Jon Stiles, Frank Neuhauser, Nora Broege, Colleen Henry, and Kristen Nelson, who created a wonderfully supportive community in which to work and write. Finally, I want to express my gratitude to my husband, Eric, whose immeasurable support over many years helped to make this accomplishment possible.

ii

In loving memory of Dr. Joel F. Sammet 1908-‐2008

iii

Table of Contents

Chapter 1: Introduction ..................................................................................................................... 1 Chapter 2: Theoretical Framework .............................................................................................. 3 Validation of Instruments for Program Evaluation ............................................................................. 3 An Item Response Approach to Validation ............................................................................................. 5 Validity Evidence Based on Test Content ................................................................................................ 8 Validity Evidence Based on Internal Structure ..................................................................................... 9

Chapter 3: Methods ........................................................................................................................... 15 The Life Effectiveness Questionnaire .................................................................................................... 15 Data ................................................................................................................................................................... 17 Data Analyses ................................................................................................................................................. 18

Chapter 4: Results ............................................................................................................................. 23 Section 1: An Interpretive Validity Argument for the Life Effectiveness Questionnaire .... 23 Section 2: Validity Evidence Based on Instrument Content ........................................................... 25 Section 3: Validity Evidence Based on Internal Structure .............................................................. 35

Chapter 5: Discussion ....................................................................................................................... 67 References ............................................................................................................................................ 72 APPENDIX A ......................................................................................................................................... 80 APPENDIX B ......................................................................................................................................... 83 APPENDIX C ....................................................................................................................................... 100 APPENDIX D ...................................................................................................................................... 112 APPENDIX E ....................................................................................................................................... 119

iv

List of Tables

Table 1 Characteristics of Respondents ....................................................................................... 17 Table 2 Relationship between Dichotomous Logit Differences and Probabilities for the Rasch

Model (adapted from Wilson, 2005a) .................................................................................. 20 Table 3 Model fit criteria for UDRS vs. UDPC .............................................................................. 36 Table 4 Rating Scale vs. Partial Credit Unidimensional Models. Item Fit and Step Misfit (in

percentages) ......................................................................................................................... 37 Table 5 Model Fit Criteria for MDRS vs. MDPC ............................................................................. 39 Table 6 Rating Scale vs. Partial Credit Multidimensional Models. Item Fit and Step Misfit ......... 40 Table 7 Fit Criteria for UDRS vs. UDPC Models Using the Consecutive Approach ........................ 41 Table 8 Rating Scale vs. Partial Credit Unidimensional Models for the Consecutive Approach .. 43 Table 9 Comparing Model Fit Criteria for Three Analyses of LEQ Data ........................................ 44 Table 10 Comparison of Reliabilities for Three Models of LEQ Data ............................................ 45 Table 11 Multidimensional Correlation Matrix ........................................................................... 45 Table 12 Average Measure Values for LEQ Data in Response Categories 1-‐8 ............................ 55 Table 13 Item:11 (Achievement Motivation: “I try to get the best results when I do things.”) .. 56 Table 14 Gender DIF by LEQ Subscale ........................................................................................ 59 Table 15 Gender DIF by LEQ Item per Subscale .......................................................................... 60 Table 16 Voluntary Status DIF by LEQ Subscale ........................................................................... 61 Table 17 Voluntary Status DIF by LEQ Item per Subscale ........................................................... 62 Table 18 Age Category DIF by LEQ Subscale ................................................................................. 64 Table 19 Age Category DIF for LEQ Items by Subscale ................................................................. 65

v

List of Figures

Figure 1. Modeling life-‐effectiveness: composite and consecutive approaches .......................... 10 Figure 2. Consecutive approach to modeling life effectiveness. .................................................. 11 Figure 3. Multidimensional approach to modeling life-‐effectiveness. ......................................... 12 Figure 4. The LEQ-‐H items by factor and response options. ........................................................ 16 Figure 5. Blueprint of LEQ factors, descriptions, sample items ( ................................................. 26 Figure 6. Wright Map of the composite (UDPC) LEQ. .................................................................. 28 Figure 7. Wright map of composite LEQ item thresholds. ........................................................... 31 Figure 8. Wright Map for Active Initiative. .................................................................................. 33 Figure 9. Wright Map for Active Initiative item thresholds. ........................................................ 34 Figure 10. Unidimensional rating scale (UDRS) weighted mean square fit statistics (MNSQ). .... 38 Figure 11. The standard error of measurement for the composite LEQ scale subscale. ............. 48 Figure 13. Wright map of item thresholds and 95% confidence intervals for the AI subscale. .. 52 Figure 14. Standard error of measurement for the Active Initiative subscale ............................ 53 Figure 15 Item Characteristic Curves for Item 11 ......................................................................... 57

1

Chapter 1: Introduction

Whether constructing or selecting a measure for program evaluation purposes, evaluators cannot make assumptions about the quality of an instrument under consideration. Rather, measurement quality must be assessed empirically. Notwithstanding the need for program evaluators to balance available resources, including fiscal restrictions and measurement expertise, the psychometric principle of validity is asserted to be “the most fundamental consideration in developing and evaluating tests,” according to The Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association & National Council for Measurement in Education, 1999, p.9). Establishing validity requires developing a structured argument to evaluate the strength and plausibility of the intended interpretation and use of test scores from a measure (Cronbach, 1988; Kane, 2001). Validity arguments should be developed with multiple sources of evidence (AERA et al., 1999) via information accumulated from individual validation studies that vary in their focus (Kane, 2006).

The purpose of this study is to illustrate the application of the Standards (AERA et al., 1999) for gathering focused validity evidence for the interpretation of scores used to evaluate experiential and out-‐of-‐school time education programs. Specifically, Rasch item response modeling is applied as a primary tool to examine validity evidence based on the test content and internal structure of the Life Effectiveness Questionnaire (LEQ) (Neill, Marsh, & Richards, 2003; Neill, 2008). The LEQ contains 24 items measuring eight proposed dimensions of life effectiveness, including achievement motivation, active initiative, emotional control, intellectual flexibility, self-‐confidence, social competence, task leadership and time management (Neill, 2008). The LEQ, or a modified version of the LEQ, has been used worldwide as a program evaluation or research tool in personal-‐development and education programs, particularly in programs that utilize outdoor or adventure education (e.g., American Institutes of Research, 2005; Australia Outward Bound, 2010; Holmes, 1996; Purdie, Neill & Richards, 2002; Sibthorp & Arthur-‐Banning, 2004).

The current investigation of validity evidence assumes that the developmental process of creating the LEQ is complete, that the measure has been adopted for program evaluation purposes, and that the primary concern of the various stakeholders is the interpretation of scores. The validation effort herein therefore constitutes part of the appraisal stage of validation, in which “a neutral or even critical stance” is taken to explore “hidden assumptions and…alternative possible interpretations of the test scores” (Kane, 2006, p.26). Confirmatory and congeneric factor analyses and multiple tests of internal consistency and sample invariance were previously utilized in the developmental stages of the validity argument for the LEQ (Neill, 2008; Wang et al., 2008). This study utilizes several extensions of the one-‐parameter Rasch item response model (Rasch, 1960,1980). Item response modeling is an approach to developing and analyzing instruments (e.g., surveys, tests and personality inventories) that incorporates items into the measurement model, thereby changing the unit of measure of an instrument from the total score to the item. The basic premise of the model is that the probability of a person successfully answering an item is a function of the ability of the person and the difficulty of the item (Wright & Masters, 1982). When measuring non-‐cognitive skills and socio-‐psychological latent

2

variables (Borsboom, 2008) such as those proposed by the LEQ, “items” are often statements about attitudes or beliefs which respondents can agree with (i.e., endorse) to a varying degree. Using item response modeling, the probability of a person endorsing an item is a function of the amount of the attribute that the person possesses and how easy it is to endorse the item.

Consideration of the quantitative, mathematical properties of items and tests is a traditional component of evaluating validity evidence. Rasch analysis provides numerous technical advantages over classical test theory for modeling latent variables, including improved precision in assessing change of scores over time, interval scaling, conditional standard errors, and properties of fundamental measurement (Wright & Masters, 1982; Bond & Fox, 2007). Despite these advantages, the statistical properties of tests and items provide insufficient evidence for the valid interpretations and use of test scores. Validity also requires evidence of the substantive meaning that underlies the quantified attributes.

One of the principle benefits of Rasch models is their ability to support substantive, meaningful interpretations of data. Rasch analyses model latent variables directly via Wright maps. A Wright map, also sometimes referred to as an item map, is a “technically calibrated version of the construct” that visually depicts the probabilistic relationship between items and responses along a shared scale representing the latent variable (Wilson, 2005a). Like a geographic map, a Wright map enables a researcher to instantly “see” and develop an intuitive sense of the complex, qualitative relationships that are present in the empirical data (Stone, Wright & Stenner, 1999, p. 312). The Wright map allows us to use the technically calibrated responses to items as empirical evidence to test hypotheses about test content and test structure. This particular form of hypothesis testing focuses on the interpretable meaning of scores and not just their statistical properties. It is the combined focus on substantive meaning and technical properties that allows us to gather the strongest validity evidence for the interpretation and use of test scores.

When used together, Rasch item response models and the Standards (AERA et al., 1999) provide powerful opportunities for assessing validity evidence based on test content and internal structure. A strong validity argument that incorporates these and other sources of evidence is crucial to developing and selecting high-‐quality instruments for program evaluations. Without a strong validity argument, test score interpretations may be questionable, at best, and must be used with caution. An unfortunate consequence of a weak validity argument may be the misallocation of resources for improving program quality. In a worst-‐case scenario, such as when program viability is at stake and depends on the interpretation of test scores, the consequences of weak validity evidence may be even more profound. Interpretive validity arguments have long been associated with program evaluation (House, 1977, cited in Cronbach, 1988; House, 1980, cited in Kane, 1992). While tens of thousands of participants have taken the LEQ in the context of a wide variety of programs and their evaluations, this is the first known application of item response modeling and the framework of the Standards to assess the validity argument for the LEQ. Furthermore, it is the first study to illustrate the application of the Standards and also of item response modeling to any instrument used in outdoor and adventure education program evaluations. The following research applies current professional standards of validity and Rasch item response modeling to the LEQ instrument to assess evidence for the interpretation of scores in their use, particularly as these scores are used in program evaluations.

3

Chapter 2: Theoretical Framework Validation of Instruments for Program Evaluation Social service, health care, and out-‐of-‐school-‐time education programs face evolving evaluation demands from both public and private sectors (Government Performance and Results Act, 1993; National Research Council, 2002; United Way, 1996; W.K.Kellogg, 1998). Though the values of funders’ accountability systems may be controversial, funding is increasingly predicated on quantitative assessments of program implementation, outcomes and impacts (Behrens & Kelly, 2008; Fixsen, Naoom, Blasé, Friedman & Wallace, 2005; Gass, 2005; Walker, Farley & Polin, 2012). With the economic viability of programs on the line, organizations are searching for instruments that can be used for program evaluations (Brosi, 2011; Kahn, Bronte-‐Tinkew, & Theokas, 2008). However, the quality of existing measures and of measures that are custom-‐created for individual programs remains a concern (Bandy, Burkhauser, Metz, 2009; Bond & Fox, 2007; Sibthorp, 2000; Yohalem & Wilson-‐Ahlstrom, 2007). Indeed, the systematic creation—or lack thereof—of quality outcome measures has been called the “Achilles heel of evaluation research on instructional innovation” (Raudenbush, 2005, p.29). The Standards for Educational and Psychological Testing (AERA et al., 1999) provide a framework for evaluating evidence of the quality of tests, based on the psychometric principles of validity and reliability. The Standards have been published in six editions since 1954 by a Joint Committee of representative members from professional organizations that are deeply involved in the development, publication, and use of tests (AERA et al., 1999). NCME urges that it is “a professional imperative” for its members to attend to the Standards. Similarly, APA has adopted the Standards as its policy, and AERA states that the Standards “represent the current consensus among recognized professionals regarding expected measurement practice” and should be observed by all relevant stakeholders (AERA et al., 1999, p.viii).

Validity theory has undergone considerable debate over the last several decades, as have standards for assessing validity. Historically, validity has been conceptualized and delineated as being comprised of three distinct types: criterion, content, and construct validity (Cronbach & Meehl, 1955; Guion, 1980). This view of validity has also been simultaneously criticized as “fragmented and incomplete” (Messick, 1989, 1995; Loevinger, 1957), because it enabled test developers and test users to selectively appraise one of these three types of validity and to proclaim that a test had been validated once and for all (Cronbach, 1988; Shepard, 1993). Over time validity has evolved in the Standards so that is now clearly defined therein as “a unitary concept” (AERA et al., 1999, p. 9) and “the degree to which all the accumulated evidence and theory supports the interpretations of test scores entailed by proposed uses of tests” (AERA et al., 1999, p. 9). All stages of test construction, documentation, and application are framed by this current conception of validity, as well as by concurrent concerns about reliability and fairness (AERA et al., 1999, p. 37).

There are two main parts to this definition of validity that are worth clarifying due to continuing misapplications of validity theory in practice. First, despite historical classifications that linger, there are not different types of validity. Rather, validity is an overarching unitary concept. Validity in this unified conception requires building a validation argument supported by multiple sources of evidence (Kane, 2006; AERA et al., 1999). In particular, the Standards establish that validity arguments require accumulating evidence based on five sources: (1) test

4

content; (2) response processes; (3) internal structure; (4) relations to other variables; and (5) consequences of testing (AERA et al., 1999). Not every source of validity evidence is always possible or desirable to include in a validity argument; the required sources of validity evidence depend on the circumstances in which test scores will be used (AERA et al., 1999). Given the complexity and extent of data required for validity arguments, Shephard (1993) advocates prioritizing the “immediate, urgent questions” that “must be answered to defend a test’s use” (p.406). Kane (1992) further emphasizes that it is the “most questionable assumptions” and “weakest parts” of an interpretive argument that require the most consideration in developing validity evidence (p. 530).

Second, tests are not validated (AERA et al, 1999; Cronbach, 1971). Instead, it is the interpretations of test scores that are validated. “To interpret a test score” Kane (1992) asserts, “is to explain the meaning of the score and, thereby, to make at least some of the implications of the score clear” (p. 527). This interpretative stance is predicated on the idea that “validity is an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment” (Messick, 1989, p.13). There are a few “relatively divergent” (Lissitz, 2009) critiques of the unified definition of validity by theorists who would like to distinguish between types of content and construct validity (Borsboom, Mellenberg & Van Heerden, 2004; Lissitz & Samuelson, 2007), or who question the inclusion of the consequences of tests (Popham, 1997) from the unified definition. There are also continuing attempts to refine the unified concept of validity (Cizek, 2012; Hood, 2009). Notwithstanding these ongoing theoretical debates, pychometricians and other professional developers of tests consider the Standards to be the authoritative source for defining and explicating validity as it relates to psychological, educational, and credentialing testing (Sireci & Parker, 2006). Additionally, the Program Evaluation Standards established by the Joint Committee on Standards for Educational Evaluation (1994) supports the unified, interpretative definition of validity (p. 145), and urges application of the Standards (AERA et al., 1999) when selecting and developing tests for program evaluation.

In practice, these two main parts of the current validity framework established by the Standards mean that validity cannot be established for a given test once for all time. Rather, as Cronbach argued, “because psychological and educational tests influence who gets what in society, fresh challenges follow shifts in social power or social philosophy. So validation is never finished” (1988, p.5). Furthermore, validity cannot be established solely by reliance on the correlational statistics and reliability coefficients that are so strongly emphasized in classical test theory models.

Nonetheless, discrepancies between principles and practices of validation remain (Messick, 1989; Shepard, 1993). In 1998 Cronbach asserted that “the 30-‐year-‐old idea of three types of validity, separate but maybe equal, is an idea whose time has gone” (p. 4). However, widely used texts for scale development, such as DeVellis (2003, 2011), and a review of measurement textbooks by Goodwin and Leech (2003) show that validity continues to be taught to developing professionals as three stand-‐alone, distinct types. Furthermore, recent reviews of test development in the psychology literature suggest that the application of current conceptions of validity and validation procedures have gained little traction among test developers in this area (Bornstein, 2011; Cizek, Rosenberg & Koons, 2008; Hogan & Agnello,

5

2004; Slaney, Tkatchouk, Gabriel & Maraun, 2009). The entrenched validation practices that diverge so starkly from established principles point to the need to apply the Standards to the development of new instruments in general, as well as to existing measures already in use, such as the LEQ in the current study. An Item Response Approach to Validation

The Standards are not prescriptive about which measurement model should be used to assess evidence for validity. Nonetheless, the measurement model that is applied to the construction and assessment of scales is central to the evaluation of inferences that can be made from test scores for use in program evaluations. The LEQ was developed using various forms of factor analysis, a family of statistical approaches to analyzing data from within the classical test theory measurement model (DeVellis, 2003). Despite its popularity as an analytical tool, factor analysis is also criticized as an overemphasized source for developing validity evidence, particularly for socio-‐psychological measures (Borsboom, 2006, 2008; Bornstein, 2011; Goodwin & Leech, 2003). In comparison with factor analysis and other methods based on classical test theory models, there are numerous technical advantages for using Rasch item response models for developing validity evidence. These advantages include interval scale properties; conditional standard errors to assess measurement precision along all levels of the latent variable; parameter separation such that item difficulty and person ability can be estimated independently of each other (i.e., sample-‐free measures and item-‐free calibrations); and the ability to assess subgroup invariance by item (i.e, differential item functioning (DIF)) when the data fit the model (Bond & Fox, 2007; Embretson, 1996; Wilson, 2005a). Just as important as the Rasch model’s potential for gathering statistical validity evidence, though, is the application of the model for empirically evaluating the theory underlying the substantive meaning of items and scales (Wilson, 2005a). Through the Rasch approach to measurement, technical aspects of the model and substantive theories about the latent variable are mutually informative, making measurement a much more qualitatively meaningful enterprise.

Item response models for non-‐cognitive variables. Although item response models are

most well-‐known for their application in the development of cognitive tests for large-‐scale educational assessments, the advantages of IRMs are also applicable to the development of non-‐cognitive variables (Reise & Moore, 2012). Non-‐cognitive variables such as those that are theoretically represented in the LEQ are frequently the focus of socio-‐psychological, health behavior change, and non-‐traditional educational programs. IRM has been used to measure non-‐cognitive variables such as: self and social development such as empathy, impulse control, friendship and conflict negotiation (Karelitz, Parish, Yamada & Wilson, 2010); leadership development (Gnambs & Batinic, 2011); sport and exercise psychology (Tenenbaum, Strauss & Busch, 2007; Heesch, Mâsse & Dunn, 2006); and leisure and recreational experiences (Bagley, Gorton, Bjornson, Bevans, Stout, Narayan & Tucker, 2011; Beach, 1997; Meakins, Bundy & Gliner, 2005). IRM is also used to appraise the psychometric properties of existing personality measures (Egberink & Meijer, 2011; Gray-‐Little, Williams, Hancock, 1997; Morizot, Ainsworth & Reise, 2007) as well as to detect aberrant response styles and faking (i.e., deliberately false

6

responses intended to mislead) in personality and organizational assessments (Eid & Zickar, 2007).

The LEQ was designed to be general enough to evaluate outcomes in a wide-‐variety of personal development programs. However, the LEQ was most specifically created as a solution to a history of instrumentation problems in adventure education programs that use outdoor experiences as a medium for promoting personal growth. The complex outcomes of adventure education programs have proved difficult to measure, either with instruments borrowed from other disciplines or instruments that are customized for a particular adventure education program, but end up being poorly designed (Sibthorp, 2000).

Inadequate or inappropriate instrumentation may be an important source of the dissonance between qualitative and quantitative findings in adventure education program evaluations, which is a persistent problem in the literature of that field. For instance, a meta-‐analysis of the adventure education program evaluations (Hattie, Marsh, Neill & Richards,1997) points to many studies in which evaluators did not find statistically significant pre-‐ to post-‐course change for outcomes that were quantitatively measured, but claimed to find evidence of change based on the qualitative methods.

Model fit. The classical true score model that was used to develop the LEQ assumes that

an observed score (X) is composed of the respondent’s “true score” (T) plus an overall “error” (E): X = T + E. Evaluation of validity evidence for the appropriateness of a particular mathematical model—that is, the investigation of a model’s “fit”-‐-‐involves a multi-‐step analytical process (Wilson, 2005a). Although tests of fit may be conducted to assess more or less restrictive assumptions for the classical true score model (DeVellis, 2003), the model cannot be formally rejected because it is one equation with two unknowns (Wilson, Allen & Li, 2006). In contrast, a fundamental assumption of item response modeling is that the mathematical model that is being applied is appropriate for the data. “An essential element to test validity,” according to Mislevy, (2009), “is whether, in a given application, using a given model provides a sound basis for organizing observations and guiding actions in the situations for which it is intended” (p.83). Altogether, four item response models for the LEQ data are evaluated in this analysis: the rating scale (Andrich, 1978) and the partial credit (Masters, 1982) models for polytomous items are each applied to the unidimensional (Adams & Wilson, 1996) and multidimensional (Adams, Wilson & Wang, 1997) models.

The statistical model underlying the estimation procedures for the four models is the multidimensional random coefficients multinomial logit model (MRCML) (Adams, Wilson and Wang, 1997). The MRCML is modeled as:

!(!!" = !;!,!, !|!) = !"#(!!"!!!!"! !)

!"# (!!"!!!!"! !)!!

!!! (1).

This equation describes the probability of response X in category k of item i; A represents the design matrix that specifies the model and B the scoring submatrix (both of which are specified by the user); ξ represents a vector of item difficulty parameters for the entire instrument; θ represents a vector of trait/ability parameters (i.e., the latent variables) on all dimensions in the instrument. Additional estimation specifications specific to the investigations are presented in

7

Chapter III -‐Methods. Rasch models for polytomous items take many forms; the models specified in the equations below represent the most basic forms of the generally specified MRCML model (Adams et al, 1997). For a more detailed discussion of the equations that can be derived for rating scale and partial credit models see Adams, Wu and Wilson (2012).

The unidimensional Rasch partial credit model for polytomous data takes the following general form:

!"# !!"

!!"!!= ! − !!" (2)

with item i and ordered response categories k (e.g., from the LEQ, response options 1 through 8). The log odds of the probability that a participant’s response to an item i is in category k (Pik), rather than in the previous category (Pik-‐1), is modeled as a function of the respondent’s location θ (e.g., competence, ability or attitude) and the relative difficulty or endorseability δik of category k of item i. The partial credit model does not assume equal distances between response categories but, instead, allows the pattern of item step parameters to vary across each item (Wright & Masters, 1982).

In contrast, the unidimensional Rasch rating scale model for polytomous data assumes that the step parameters (i.e., transitions from one response category to another) are equidistant and constrains the distance between response options across all items (Wright & Masters, 1982):

!"# !!"

!!"!!= ! − !! + !! (3)

where τ represents the step parameter for each of the k categories. Many scales, such as the LEQ, make the assumption that the Likert-‐style response categories function as equally spaced interval categories that can be averaged and meaningfully interpreted (van Alphen, Halfens, Hasman, & Imbos, 1994; DeVellis, 2006). Fitting the LEQ data to the partial credit and rating scale Rasch models provides an empirical test of this assumption, by both statistically estimating and graphically depicting whether respondents interpret the distance between response categories as constant. Showing how well the LEQ data fit the rating scale and partial credit models is therefore an important source of evidence for validity based on the internal structure of the instrument. The eight subscales of the LEQ were designed to be structurally distinct, yet intercorrelated (Neill, 2008). Statistically, then, the assumption of unidimensionality for the life effectiveness construct may not hold and, theoretically, a multidimensional model for LEQ data appears to be the most appropriate. It is no longer the case, as claimed in a recent text on scale development that “if a set of items is multidimensional (as a factor analysis might reveal), then the separate, unidimensional item groups must be dealt with individually….under both classical and IRT approaches”, yielding a set of distinct scales for each group of items (DeVellis, 2011, p. 159). Rather, a joint, multidimensional calibration of data is possible through the application of the MRMCL (see Equation 1) (Adams, Wilson & Wang, 1997). The multidimensional approach to the Rasch model examines all the subscales simultaneously, providing disattenuated

8

dimensional correlations of latent attributes and improved reliability for each of the sub-‐scales (Briggs & Wilson, 2003).

In the MRCML, each respondent has a θd value (ability estimate) for each dimension of life effectiveness that is measured by the LEQ: θ = (θ1, …, θD), where the dimensions are allowed to be non-‐orthogonal (i.e., correlated). For the generalized form of the multidimensional Rasch partial credit model:

!"# !!"!!"!!

= !! − !!" (4)

the log odds of the probability of a person’s response in category k of item i (Pik) when compared with category k-‐1(Pik-‐1) is a linear function of respondent ability for that dimension θd and the relative difficulty of category k (δik). The multidimensional rating scale model for polytomous data is similar but, like the unidimensional model, adds a constraint for the tau parameters of the items:

!"# !!"!!"!!

= !! − !! + !! (5).

Given the advantages IRMs, as well as the history of using IRMs to develop and analyze

non-‐cognitive variables, IRM is well-‐suited to examine validity evidence for interpreting and using scores from the LEQ. Of the five sources of validity evidence that are articulated in the Standards (AERA et al., 1999), the Rasch family of item response models is particularly useful for assessing evidence based on test content and internal structure.

Validity Evidence Based on Test Content Test content, according to the Standards, includes the “themes, wording, and format of the items, tasks, or questions on a test, as well as the guidelines for procedures regarding administration and scoring” (AERA et al., 1999, p. 11). Validity evidence based on test content is related to four central aspects of a test: (1) content specification, or definition of the content domain; (2) content representativeness; (3) content relevance; and (4) the procedures of test construction (Sireci, 1998a; AERA et al., 1999). There are two primary sources for evaluating evidence of test content. The first, most traditional, procedure is the elicitation of judgments by subject matter experts as to the relevance and representativeness of content to the specified domain (Sireci, 1998b, AERA et al., 1999).

The Standards also specify that empirical analyses can be used to evaluate test content (AERA et al., 1999). In particular, the Standards suggest that “test developers might provide a logical structure that maps the items on a test to the content domain, illustrating the relevance of each item and the adequacy with which the set of items represent the content domain” (AERA et al., 1999, p19). This map that specifies content areas and items related to these areas is also referred to as a blueprint (Cronbach, 1971; Sireci, 1998a) or a construct map (Wilson, 2005a). The blueprint is the hypothesized map, or theoretical outline, of the variable that is the subject of the instrument. If a construct mapping approach (Wilson, 2005a) is used to create test content, instrument development involves intentionally connecting the meaning of item content and the location of items on a shared scale.

A principle strength of the Rasch model for gathering validity evidence based on instrument content is its ability to directly model hypothesized latent variables. The Wright

9

map produced by the Rash model provides interpretable, empirical evidence about the nature of the theorized construct as manifested in the relationship between person responses and item content (Wilson, 2005a). Rasch analysis allows us to progress from interpreting a theoretical blueprint of the LEQ construct to interpreting an empirical map of the construct that is based on self-‐report data from LEQ items. This allows us to test whether the data correspond with the intent of the content. Data that correspond with the blueprint are a strong source of validity evidence. If data do not align as intended or if content of the test otherwise “fails to capture important aspects of the construct”, this indicates a potential threat to validity through construct underrepresentation (AERA et al., 1999, p. 10; Messick, 1989). Construct underrepresentation results in a “narrowed meaning of test scores", thereby weakening the argument for the use of the test (AERA et al., 1999, p. 10).

Validity Evidence Based on Internal Structure According to the Standards, analyses of internal structure provide evidence about the degree to which “relationships among test items and test components conform to the construct on which the proposed test score interpretations are based” (AERA et al., 1999, p. 13). The Rasch family of item response models is well suited to examining these relationships. Test dimensionality is the focus of Standard 1.11, and it establishes requirements for evidence related to the internal structure of tests (AERA et al., 1999). Standards 1.10 and 1.12 are related to the issue of dimensionality, and require a rationale and validity evidence for the interpretation and use of composite and subscale scores. In particular, the Standards suggest that if interpretation of scores is based on small subsets of items, users of such subtests should be provided with “guidance to enable them to judge the degree of confidence warranted” and “score reports should discourage overinterpretation of information that may be subject to considerable error” (AERA et al., 1999, p. 19).

These standards address the overlap between the way constructs are theoretically and statistically modeled, and how the constructs are scored and interpreted in practice. The length and content of the LEQ composite scale and subscales were deliberately designed to balance five measurement aims: assessment of life skill competencies, relevancy to program goals, practical field usability, sensitivity to change, and the value of the instrument as an educational tool (Neill, 2008, p. 123). In other words, the instrument was very thoughtfully and purposefully constructed to provide an optimum balance between parsimony and validity (Neill, 2008).

Applied use of measurement models. The interpretation and use of ability scores or competency estimates from the LEQ data can be conceptualized in terms of three approaches: composite, consecutive and multidimensional (Adams, Wilson & Wang, 1997; Briggs & Wilson, 2003; Wilson, 2005b). In a composite approach, scores from multiple subscales are aggregated as if they represent a unidimensional construct, and are interpreted as a single score. The composite approach to the LEQ is modeled below in

Figure 1.

10

Figure 1. Modeling life-‐effectiveness: composite and consecutive approaches (after Briggs & Wilson, 2003).

Composite Approach Xi = Individual items θLE = Single estimate of

latent life effectiveness. Twenty four items contribute to the θLE

11

In a consecutive approach, subscales are also interpreted as unidimensional but are interpreted through individual scores that are obtained for each subscale, one after another, as modeled below in Figure 2.

Figure 2. Consecutive approach to modeling life effectiveness. A third approach is to conduct a multidimensional analysis of subscales, which models multiple dimensions as distinct yet interrelated, as shown for the LEQ in Figure 3. In practice, the LEQ is administered and analyzed as both a composite, 24-‐item unidimensional scale and as eight consecutive, unidimensional subscales, with three items for each subscale (Neill, 2007, 2008). The multidimensional approach is considered here as well for the statistical and substantive interpretational advantages it may provide with regard to validity evidence

Consecutive Approach Xi = Individual items θ = Eight independent estimates of latent life

effectiveness, with three items for each dimension, respectively.

12

Figure 3. Multidimensional approach to modeling life-‐effectiveness.

Multidimensional Approach Xi = Individual items θ = Eight correlated estimates

of latent life effectiveness

13

The three approaches to modeling a construct typically provide a number of trade-‐offs in terms of the interpretability and use of test scores. Reliability is usually increased through a composite approach, yet this comes at the cost of lost information about the distinct outcome areas and a decrease in the overall interpretability of the instrument. In practice, a single composite score for the LEQ may be created primarily for policy makers and funders, because it allows for easy comparison of global program effects among groups and programs. Conducting consecutive analyses of subscales improves the information available for each outcome area. There is potentially great practical benefit to the consecutive subscale approach because it allows each personal development program to tailor its evaluation instrumentation to a program’s specific desired outcomes. That is, consecutive modeling allows researchers to pick and choose among the subscales, thereby customizing the LEQ to align with the unique objectives of each program. However, reliability estimates for each consecutive analysis tend to be lower than that of the composite model. The standard error for each estimate from the consecutive analysis will also usually be larger than that of the composite model, because there are fewer items defining the dimensions (e.g., each LEQ subscale has 3 items, rather than 24 items for the entire LEQ). Alternately, longer subscales with higher reliability and lower standard errors may increase the time and cognitive burden on respondents. The primary anticipated drawback of using the subscales as independent scales is that this approach does not fully model the ways in which the subscales may be interrelated both statistically and theoretically (Allen & Wilson, 2006).

Optimally, a multidimensional approach will provide more information about multiple latent constructs than is possible with a composite analysis, while also improving upon the reliability estimates that are produced by consecutive analyses. Although not currently applied in practice, the multidimensional approach to modeling the LEQ and deriving competency estimates is theoretically the most appropriate and is likely to be statistically supported.

Threats to validity: Analysis of differential item functioning. In addition to outlining sources of evidence for validity, the Standards identify construct-‐irrelevant variance as a primary threat to the valid interpretation and use of test scores (AERA et al, 1999). Construct-‐irrelevant variance occurs when test scores are “systematically influenced to some extent by components that are not part of the construct” (AERA, et al., 1999, p. 10). Such variance can result in making an item or test format more difficult or easier for some individuals or groups (Messick, 1989; AERA et al., 1999). One empirical method of checking that item functioning is invariant for groups of respondents is to conduct a statistical test of differential item functioning (DIF). An item is said to display DIF when one group has a different probability for success on an item than does another group, after controlling for the ability of interest (Clauser & Mazor, 1998; Holland & Wainer, 1993). For instance, an item can be said to exhibit gender DIF if, on average, the item is harder, or more difficult to endorse, for girls of a particular ability (or level of a latent trait) than it is for males who have the same ability. DIF should not be confused with overall group differences in mean abilities of the latent trait, known as differential impact.

Among the potential causes of DIF are an item’s content, cognitive demands, item type, item text and visual-‐spatial or reference factors (Zenisky, Hambleton & Robin, 2004). Reviews of response processes, via cognitive interviews or think alouds, may also yield information

14

about the causes of DIF in an item (Johnstone, Thompson, Bottsford-‐Miller & Thurlow, 2008. Although the review process for trying to determine the causes of DIF is essential, it is also important to note that even expert review panels experience difficulty determining why an item functions differently for one group compared with another (AERA et al., 1999).

Organizations that use the LEQ for program evaluation purposes will likely be interested in knowing if items function differently for the categories of gender, race/ethnicity, socio-‐economic status (SES), age, volunteer enrollment status, and country of origin. Insufficient data was available to analyze the LEQ for race/ethnicity, SES, or country of origin, although each of these categories deserves analysis as the LEQ is further used internationally with diverse groups of respondents. In the current analyses, the LEQ is examined for DIF of gender, age and volunteer status, using the partial credit unidimensional (UDPC) model for both the composite (24 item) LEQ scale and the eight LEQ subscales (3 items per scale).

Differential item functioning does not necessarily mean that an item is flawed or biased. Rather, DIF means that the item is functioning differently for subgroups of respondents and is likely measuring something that is irrelevant to the intended construct. Removal of items from tests, however, is not always necessary nor without controversy. The practical consequences of retaining or removing an item with DIF depend, in part, on the context of an item’s intended use. In high-‐stakes situations, for instance, items displaying even small amounts of DIF are argued to have serious consequences for the validity of the test scores and for all decisions that are based on the scores, including scholarships and admission to higher education (Santelices & Wilson, 2010). Since items displaying any sort of DIF in high-‐stakes situations can lead to an entire test being judged unfair or biased (AERA et al., 1999), they may be automatically removed without further consideration of the underlying cause of the DIF.

In other contexts, for instance if an item displays DIF in a behavioral or attitudinal survey, observed DIF could be revealing cultural differences that provide insight into tailoring interventions or teaching specific skills to a particular ethnic group (Watson, Baranowski, & Thompson, 2006). In this case, the items displaying DIF may be worth retaining. In yet other scenarios, an item may statistically display DIF, but the effect size—or the practical effect—of the DIF may be negligible or even moderate, and the item may be unique enough in content or difficulty that it is worth retaining (AERA et al., 1999; Wilson, 2005a). It has also been argued that systematic removal of items displaying DIF is actually a discriminatory procedure that has the primary effect of preserving prior test-‐score distributions in which there is a black-‐white test score gap favoring whites (Lucas, 2008).

15

Chapter 3: Methods The Life Effectiveness Questionnaire

The LEQ is structured so that it can be administered and analyzed as either a composite, 24-‐item unidimensional scale or as eight unidimensional subscales, with three items for each subscale. The eight subscales and their corresponding items, numbered 1 through 24, are displayed in Figure 4. The LEQ went through five developmental stages between 1986-‐2000, during which time domains and items were continuously re-‐evaluated, modified or removed based on confirmatory factor analysis and substantive theoretical considerations (Neill, 2008). Neill (2008) utilized congeneric and confirmatory factor analyses to derive an 8-‐factor, 24-‐item instrument (LEQ-‐H) with a high Tucker-‐Lewis index of fit (TLI =0.984; N = 1,892) from a previous 11-‐factor, 64-‐item instrument (LEQ-‐G). A global second-‐order model produced a high TLI (0.980) (Neill, 2008).

The LEQ data used in this study were collected under conditions in which respondents were given a single piece of paper with instructions and examples for completing the LEQ on one side of the paper, and LEQ items and response options on the other side (Neill, 2008). The response option scales contained different written labels on the two different sides of the paper. The items side of the paper included only the labels “FALSE not like me” and “TRUE like me” above opposite ends of the numbered 1-‐8 scale. The instructions side of the paper included the same False/True labels above the 1-‐8 response options, as well as more detailed labels under pairs of response categories (e.g., “This statement doesn’t describe me at all; it isn’t like me at all” under the number 1 and 2). (See Appendix A for a complete version of the LEQ instructions and items pages.) The effect of the different labeling strategies on respondents’ interpretations of response options categories is not discussed in Neill (2008). Analyses in this current chapter assume that respondents relied primarily on the more limited anchor labels, because these were in bold, large print at the top of the items page and immediately above the numbered response options that respondents were required to choose among for every LEQ item.

16

Time Management (TM) 01. I plan and use my time efficiently. 09. I do not waste time. 17. I manage the way I use my time well. Social Competence (SO) 02. I am successful in social situations. 10. I am competent in social situations. 18. I communicate well with people. Achievement Motivation (AM) 03. When working on a project, I do my best to get the details right. 11. I try to get the best results when I do things. 19. I try to do the best that I possibly can. Intellectual Flexibility (IF) 04. I change my thinking or opinions easily if there is a better idea. 12. I am open to new ideas. 20. I am adaptable and flexible in my thinking and ideas. Task Leadership (TL) 05. I can get people to work for me. 13. I am a good leader when a task needs to be done. 21. As a leader I motivate other people well when a task needs to be done. Emotional Control (EC) 06. I can stay calm in stressful situations. 14. I stay calm and overcome anxiety in new or changing situations. 22. I stay calm when things go wrong. Active Initiative (AI) 07. I like to be busy and actively involved in things. 15. I like to be active and energetic. 23. I like to be an active 'get into it' person. Self Confidence (SC) 08. I know I have the ability to do anything I want to do. 16. When I apply myself to something I am confident I will succeed. 24. I believe I can do it. FALSE TRUE not like me like me -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ 1 2 3 4 5 6 7 8 Figure 4. The LEQ-‐H items by factor and response options.

17

Data The LEQ data used in this research are from Neil’s (2008) original sample. Neill collected LEQ data from a convenience sample made up of participants in outdoor education programs. Neill reported that the LEQ was explained to program participants as “a chance to self-‐evaluate, reflect and possibly set goals” and that “briefings to participants included explanation that the participants’ data would be used to analyze and better understand the programs” (2008, p. 128). Only baseline data from LEQs that were completed on the first day of a program were analyzed for this paper (N=3,634), although data were collected from respondents on up to four occasions.

Characteristics of respondents are summarized in Table 1. No information was available about the race/ethnicity or socio-‐economic status of participants. Voluntary status signifies that participants chose of their own free will to attend an adventure education program. Non-‐voluntary status in this sample is associated with adult, corporate-‐training participants and also with youth who attended schools, primarily private, that made attendance at Outward Bound Australia a compulsory component of school enrollment. Programs in which respondents participated were primarily affiliated with Outward Bound Australia (83%). Table 1 Characteristics of Respondents

n % Gender Male 2161 60 Female 1452 40 Age Under 19 1512 42 19-‐24 1070 29 25 & over 1037 29 Voluntary Not Voluntary 515 14 Voluntary 2828 78 Partly Voluntary 285 8 Program type Outward Bound Australia 3015 83 Non-‐Outward Bound 619 17

18

Data Analyses

Model fit. The ConQuest item response modeling software (Wu, Adams & Wilson, 1998) was used for all analyses. Cases are constrained1 and full standard errors were computed for all analyses (except where noted for DIF analyses) in order to maximize available item information. Unidimensional models were estimated using the Gauss-‐Hermite quadrature approximation method. Multidimensional models were estimated using the Monte Carlo method. Person separation reliability coefficients were calculated for unidimensional models using weighted likelihood estimation (WLE) (Warm, 1989), which corrects for certain types of bias (asymptotic variance) in the parameter estimates. Expected a posteriori estimates were calculated for multidimensional models. Detailed information on the statistical estimation procedures used by the ConQuest software package can be found in Adams and Wilson (1996) and Wu, Adams, Wilson & Haldane (2007).

Two tests of fit were applied iteratively to the four different models (UDPC, UDRS, MDPC, MDRS) to determine the relative strengths and weaknesses of each. The four models include a partial credit unidimensional (UDPC) model; a rating scale unidimensional (UDRS) model; a partial credit multidimensional (MDPC) model; and a rating scale unidimensional model (UDRS). First, statistical evidence of overall model fit was examined by conducting a likelihood ratio test comparing differences in the deviance and number of parameters among the models, using a chi-‐square statistic to determine significance. This type of analysis is feasible when the models being compared are hierarchically-‐related or nested (e.g., the rating scale model is a constrained version of the partial credit model).2

In the second fit test, the analysis moves from the broad fit of each model to a more narrow examination of residual-‐based fit statistics for items and steps. Item fit and step fit are evaluated by using two fit indexes: the weighted mean square fit statistic and the weighted t (Wright and Masters, 1981). The weighted mean square statistic, also called the infit statistic, allows us to examine how much the actual residuals (the difference between the observed score and the expected score for a specific person and item) vary in comparison to how they are expected to vary randomly if the data (i.e., responses to items) fit the given model (Wilson, 2005a). When residuals vary about as much as expected, the mean square values should be about 1.0 (Wilson, 2005a). Items with a mean square value greater than 1.0 have observed variance that is greater than expected. Items with a mean square value less than 1.0 have less observed variance than expected which can occur, for instance, if there is local dependence among items due to a common stimulus (Wilson, 2005a).

1 When cases are constrained, the ConQuest software forces the means of persons along the latent variable to be set to zero and allows all item parameters to be estimated. Alternately, the mean of the parameter estimates for each term in the model statement can be forced to zero by constraining on items (Wu, Adams, Wilson, Haldane, 2007). 2 Statistical significance can also be assessed using Akaike’s information criterion (AIC; Akaike, 1974). However, the Bayesian information criterion (BIC; Raftery, 1995) is less useful because it only orders the parameters and doesn’t indicate what is statistically different between them.

19

There is no absolute criterion that can be applied to a weighted mean square value to determine item fit. The weighted t is used to test the statistical significance of the mean square, and when the data fit the model, the t statistic will have a mean near zero and a standard deviation near one (Wright & Masters, 1981). However, the t statistic is sensitive to sample size and must be considered with caution because large sample sizes will result in many items with significant values. For purposes of this study, items with a weighted fit statistic less than 0.75 or greater than 1.33 and accompanied by a weighted t statistic with an absolute value greater than 2.00 were considered problematic (Wu & Adams, 2007). Models with few poorly-‐fitting items, given these criteria, are judged to have relatively better fit than models with many poorly-‐fitting items. The same fit statistic criteria are applied to the step parameter estimates to compare the rating scale and partial credit models.

Analysis of Wright map person-‐item distribution. The composite and consecutive approaches are the focus of the validation effort after initially examining model fit, because they are the two approaches currently utilized for interpreting and using scores from the LEQ in program evaluations. The partial credit model is used to produce Wright Maps for the composite and the consecutive approaches to the LEQ. A Wright map graphically depicts empirical data, illustrating the probabilistic relationship between item difficulty and person proficiency estimates along the shared scale of the latent variable. The unit of measure on the interval scale produced by the item response model is a logit. A logit is the log of the odds ratio. Logit scaling can easily be rescaled to another score range without loss of generality. The utility of the Wright map is to make the relative locations of these estimates both interpretable and meaningful (Wilson, 2005a).

The dichotomous Rasch model expresses the probabilistic relationship between person ability and item difficulty mathematically as:

! !! = 1 !, !! = !(!!!!)

!!!(!!!!) (6)

where !! = 1 !, !! is the probability of person X scoring a correct response on item i, given person ability (!) and item difficulty (!!). This probability is equal to the constant ! (i.e., the log function; approximately 2.718) raised to the difference between person ability and item difficulty (! − !!), divided by 1 plus the value found for (! − !!) (Wilson, 2005a). The relationship between logits and the probability that is calculated from equation 6 is illustrated in Table 2 below. The left hand column in the table shows the difference between the respondent location (represented by Xs on the left side of a given Wright map) and item difficulties (i.e., represented with LEQ item numbers 1 – 24 on the right side of a map) measured in logits.

20

Table 2 Relationship between Dichotomous Logit Differences and Probabilities for the Rasch Model (adapted from Wilson, 2005a)

Θ-‐δ Probability of Success -‐4.0 0.02 -‐3.0 0.05 -‐2.0 0.12 -‐1.0 0.27 0 0.50 1.0 0.73 2.0 0.88 3.0 0.95 4.0 0.98

Wright Maps of calibrated LEQ data were used to assess evidence for the ordering and

location of items, the range of item-‐to-‐person locations for the LEQ, the range of content representation, and response category functioning. If the researcher hypothesizes, prior to data collection, that the construct of interest is well-‐represented by a relevant set of items that are qualitatively ordered along a high-‐low continuum, the Wright map can be used to confirm this hypothesized ordering. Items that are ordered in the Wright map as expected provide empirical evidence for the theoretical model that was used to develop instrument content (Wilson, 2005a). Items that are out of order may point to the need to re-‐assess the content relevance and representativeness that underlie the construct theory or cognitive model (Wilson, 2005a). Where there is no a priori hypothesis of ordering, the opportunities for theory confirmation via Wright Map interpretation are diminished. Even without an a priori hypothesis of ordering, extant item locations are useful to assess the contribution of individual items to the overall scale. Items and item threshold locations (the point at which the k+1 step becomes more likely than the k step) may be too sparse, or may be over-‐represented, indicating a lack of precision or over-‐precision in measuring respondents along some parts of the construct. Such information is essential, for instance, if assessment is to be used to inform curriculum development and implementation, or to measure pre-‐ to post-‐course change.

Item statistics. In addition to the fit statistics for items that are produced and examined

during model fit analyses, mean location values are reviewed for the composite LEQ and subscales, as are traditional item statistics produced in the ConQuest output. Evidence of response category functioning is assessed through analysis of the Wright Map for item thresholds and item characteristic curves.

Differential item functioning. In the current analyses, the LEQ is examined for DIF by

gender, age and volunteer status, using the UDPC model for both the composite LEQ scale (24 item) and the eight LEQ subscales (3 items per scale). Three age categories coded in Neill’s (2008) original LEQ database were used for the DIF analyses: (a) under 19; (b) 19-‐24; (c) 25 and

21

older. Data for male/female, as well as voluntary/non-‐voluntary status, were also originally coded by Neill (2008).

There are several methods for investigating DIF using item response modeling (Paek & Wilson, 2011; Camilli & Shephard, 1994; Clauser & Mazor, 1998; Holland & Wainer, 1993). The approach used in this paper adds an item-‐by-‐group interaction parameter to the original model. The two models are then compared using the chi-‐square likelihood ratio test with degrees of freedom equal to the difference in the number of estimated parameters between the two models. If the DIF model for each subscale or the composite scale shows a better fit at the p=0.05 significance level, individual items are further examined. For the gender and voluntary categories, individual items exhibit statistically significant DIF if (a) an item’s parameter estimate is equal to or greater than twice its corresponding standard error and (b) it has a large effect size (Paek & Wilson, 2011). For the age category, the DIF parameter estimate is the difference in parameter estimates between the corresponding groups (e.g., the item parameter estimate of the under 19 age group minus the item parameter estimate of the 19-‐24 age group, and the item parameter estimate of the under 19 age group minus the item parameter estimate of the over 24 age group).

Standard errors for DIF analyses are usually computed using the full error variance-‐covariance matrix for the model that has been estimated, which produces the most accurate estimates of asymptotic error variances (Wu, Adams, Wilson & Haldane, 2007). This procedure was specified for calculating standard errors for the Gender and Voluntary DIF analyses. For the Age DIF analyses, the estimation procedure would not converge when using the full standard errors approach. Computation of full standard errors requires that every possible response category (i.e., 1-‐8 in the LEQ) be represented by at least one case for every item. Within the dataset (Neill, 2008) used for these analyses, 22 of 24 items had less than 1% of the total number of responses in response category 1 and 11 of 24 items had less than 1% of responses in response category 2. Response categories that were sparsely represented in the full dataset resulted in empty response categories when the dataset was divided into the three different age groupings. When there are only a few empty response categories, these can be represented through dummy cases and the full standard errors approach may continue to be used. However, the Age DIF analysis required the “quick” procedure for calculating standard errors, which ignores covariances between response model parameters and converges with multiple empty response categories, and thus tends to result in an under-‐estimation of standard errors (Wu, Adams, Wilson & Haldane, 2007).

The results of DIF analyses are impacted by sample size, with large samples providing enough power to detect small differences between groups, often resulting in statistically significant findings. Therefore, while DIF analysis is an important indicator that items are functioning statistically differently between groups, it does not indicate the effect size or magnitude of the functioning issue. Effect size is computed by comparing the absolute value of two times the DIF parameter of an item for pairwise comparisons or the difference between parameters for multiple-‐group comparisons to an agreed standard. The magnitude of the

22

effect3 is considered: (a) negligible, if the absolute value of the effect size is less than 0.426; (b) intermediate, if the absolute value of the effect size is between 0.426 and. 0.638 inclusive; or (c) large, if the absolute value of the effect size is greater than 0.638 (Paek, 2002; Paek & Wilson, 2011).

The magnitude of effect size is related to whether the DIF parameter estimate is likely to have a practical impact in the overall estimation of respondent abilities. For instance, if the effect size is less than 0.426, this means the difference in the ability estimate for the two groups of interest is less than four tenths of a logit. Items showing intermediate or large effect sizes are reviewed to consider possible sources of the DIF and whether or not these items should be retained in the instrument. For LEQ items displaying intermediate or large DIF, the consequences of retaining or removing items from the LEQ are considered in the contexts of the intended interpretation and use of LEQ scores.

3 DIF classification rules for Rasch models are constructed to correspond with the Education Testing Service (ETS) DIF classification rules based on the Mantel-‐Haenszel approach to investigating DIF (cf. Longford, Holland, & Thayer, 1993, p. 175, cited in Paek & Wilson, 2011).

23

Chapter 4: Results

Instrument validation should be an iterative, non-‐orderly process that focuses on the relationships between items, the scoring of responses, the substantive theory underlying items and response options, and the measurement model (Brown & Wilson, 2011). It follows that the reporting of results from this validation effort focuses on the relationships among these many parts for the LEQ and is, at times, non-‐linear. The relationships are framed in this chapter by two sources of validity evidence from the Standards (AERA et al., 1999): evidence based on test content and internal structure. Findings from one source of evidence often inform findings for another, so some overlap in reporting of results is expected.

Results are organized into three sections. The first section reviews specifications for the intended interpretation and use of scores that are articulated by the developers of the LEQ. This section highlights some assumptions of the instrument developers’ interpretive argument. The second section examines validity evidence based on instrument content for the composite LEQ and the eight subscales of the LEQ. This second section on test content begins with a review of the blueprint for the instrument and proceeds with examining findings from the Rasch IRM analyses of the LEQ data. The third section examines validity evidence based on the internal structure of the composite LEQ and each of the subscales. This third section has several main subparts: (a) a review of evidence for model fit dimensionality and response category functioning based on model fit analyses; (b) a review of relationships between items and the instrument as a whole, including item-‐person distribution, conditional standard errors of measurement, and average measure values; and (c) a review of DIF for groups based on gender, age, and voluntary status.

Section 1: An Interpretive Validity Argument for the Life Effectiveness Questionnaire Validation of an instrument should begin with specifications by test developers in which “an explicit statement of the proposed interpretation of test scores, along with a rationale for the relevance of the interpretation to the proposed use” are specified (AERA et al., 1999, p. 9; Standards 1.1, 1.2). Such specifications constitute the interpretive argument of the validation process and lay the foundation for the types of evidence that should be accumulated in the appraisal, or evaluation, stage of validation (Kane, 1992). The interpretive argument for the LEQ is reviewed below, beginning with the test developers’ definition of and rationale for the life effectiveness construct, followed by the intended interpretation and use of LEQ scores.

Life Effectiveness as a latent variable. Life effectiveness, the latent variable theorized to be measured by the LEQ, is defined as a person’s self-‐perceived capacity and skill for adapting, surviving and thriving in life (Neill et al., 2003; Neill, 2008). The LEQ seeks to measure life effectiveness by asking respondents to assess their behavior in eight domains: Time Management (TM), Social Competence (SO), Achievement Motivation (AM), Intellectual Flexibility (IF), Task Leadership (TL), Emotional Control (EC), Active Initiative (AI), and Self-‐Confidence (SC). The amount of life effectiveness possessed by a person represents the extent to which that individual “actively and ably handles…goals and challenges” (Neill, 2008, p. 58) both within and across these various domains.

24

Life Effectiveness as a construct was developed in the context of outdoor adventure education programs, in which the authors have extensive experience in programming or outcomes research, or both (cf., Marsh, Richards & Barnes, 1984; Marsh & Richards, 1990; Hattie, Marsh, Neill & Richards, 1997). Nonetheless, the factors and items of the LEQ were simultaneously theorized to be broadly applicable to life effectiveness as it is addressed in a wide variety of experiential and social-‐psychological programs (Neill, 2008). The authors derived the proposed factors of the LEQ from “theory and research in the fields of psychology, education, and outdoor education (Neill, 2008, p. 97).

Neill (2008) posits that the life effectiveness construct draws upon, but is differentiated from, other constructs such as self-‐concept (Combs, 1999, as cited in Neill), self-‐efficacy (Bandura, 1997, as cited in Neill) and practical intelligence (Sternberg, Wagner, Williams & Horvath, 1995, as cited in Neill). Life effectiveness differs from these constructs in that it focuses on a person’s perception of their actual success in performing practical life skills, not whether they believe they have the capacity to perform various life skills (Neill, 2008). Life effectiveness also focuses on generic abilities that give “cross-‐situational advantage” (Neill, 2008) across various contexts, rather than task-‐specific knowledge. In contrast to personality traits that are widely believed to be relatively stable and predictably expressed across different contexts, life effectiveness is conceptualized to encompass behavioral, cognitive and emotional competencies that are both learnable and changeable. Thus, both the global construct of life effectiveness, as well as its theorized eight sub-‐constructs, may be affected by an individual’s participation in a variety of psychosocial intervention programs that target these competencies.

Intended interpretation of LEQ scores. The LEQ is structured so that, in theory, it can

be administered and analyzed as either a composite, 24-‐item unidimensional scale or as eight unidimensional subscales, with three items for each subscale. This structure was designed so that program evaluators could use the LEQ to gather as much information as possible about multiple constructs while simultaneously maximizing reliability and minimizing respondent and program burden (Neill, 2008). LEQ scores between Time 1 and Time 2 (pre/post intervention) are described in terms of the magnitude (e.g., small, moderate, large) and direction of change (e.g., negative change, no change, positive change) (Neill, 2008, p.232), but other substantive qualities relating to change in life effectiveness are not identified. Life effectiveness scores can be interpreted to mean more or less of behaviors, skills and competencies that a person possesses and performs that are “necessary for, and beneficial to, effective and successful human living and working” (Neill, 2008, p. 123). “Higher scores” on the LEQ are said to indicate “more positive self-‐assessments” (Neill, p.156). Presumably, this means that high scores on the composite scale correspond to respondents’ high self-‐assessment of their overall life effectiveness (i.e., their behavioral, emotional and cognitive competencies).

Intended use of LEQ scores. The LEQ was intentionally designed so that interpretations

of LEQ scores could be used primarily to evaluate whether “experienced-‐based intervention programs” had some effect on students’ life effectiveness (Neill, 2008, p.124). A secondary “design principle” of the LEQ was that the test would be “educational and interesting” for program participants (Neill, p. 124), building upon the idea that reflection on test scores is often used programmatically to promote “personal awareness, growth and action” (AERA et al., 1999,

25

p. 129). These intended uses were constrained by expectations that the instrument would simultaneously be: (a) relevant to the outcome aims of experience-‐based intervention programs; (b) easily and quickly used in a field setting and; (c) able to “elicit meaningful response distributions” that are both sensitive to change and to no change over time.

Section 2: Validity Evidence Based on Instrument Content

Gathering validity evidence for the interpretation and use of test scores begins with “an analysis of the relationship between a test’s content and the construct it is intended to measure” (AERA et al., 1999). The content of the LEQ (Neill, Marsh, & Richards, 2003) is intended to measure a newly defined construct, Life Effectiveness (Neill, 2008). Eleven original LEQ factors—or subdomains—are listed in the blueprint (Figure 5). The subdomains of Organizational Self-‐Discipline and Productive Teamwork were removed from the LEQ-‐G to create the LEQ-‐H, as described in Neill (2008). The number of proposed and final factors in the LEQ is evidence of the wide theoretical net that the instrument’s authors cast in an effort to capture construct representation for the global domain of Life Effectiveness.

The blueprint simultaneously provides evidence of the more limited theoretical scope and corresponding item content within each of the subdomains. Each subdomain is implicitly hypothesized, as: (1) unidimensional and (2) consisting of one qualitative category or level of meaning, as evidenced by the corresponding description in the blueprint. A review of items suggests that item content for any given subdomain is: (1) aligned with one qualitative category or level of meaning; (2) substantively interchangeable; (3) not intentionally ranked or ordered substantively. For example, social competence in the LEQ is defined in the blueprint as “Effectiveness, confidence, and competence in social interactions.” The theoretical approach to defining social competence in the LEQ rests on a generalized, summative conception of the construct. The generalized approach is similar to many others in the social competence

26

Factor Ab. LEQ Description Sample Item

Achievement Motivation AM G,H

Motivated to achieve excellence and put the required effort into action to attain it. I try to do the best that I possibly can.

Active Initiative AI G,H

Initiates action in new situations. Likes to be busy, energetic, and actively involved in things. I like to be an active, ‘get into it’ person.

Emotional Control EC G,H Capacity to stay calm in new, changing or stressful

situations. I stay calm when things go wrong. Hardiness Resourcefulness HR G Succeeds in performing difficult tasks with limited

resources. I can perform with limited resources.

Intellectual Flexibility IF G,H

Adapts thinking and accommodates different perspectives when presented with better ideas. I am open to new ideas.

Organisational Self-‐Discipline OD G

Controls behaviour in a logical, sensible, and organised manner. I make logical decisions and am self-‐disciplined in acting on these decisions.

Productive Teamwork PT G

Uses interpersonal skills to work productively in team situations. I am productive and cooperative when working with a team.

Self Confidence SC H Confidence in personal abilities and success of one’s

actions. I believe I can do it. Social Competence SO H Effectiveness, confidence, and competence in social

interactions. I am successful in social situations.

Task Leadership TL H

Leads and motivates other people effectively when a task needs to be done. I am a good leader when a task needs to be done.

Time Management TM H

Makes optimal use of time. Efficient time management, with minimal time wastage. I manage the way I use my time well.

Figure 5. Blueprint of LEQ factors, descriptions, sample items (adapted with permission from Neill, 2008, p.126). literature, yet it also competes with alternate, multidimensional and hierarchical conceptions of social competence that include qualitatively different levels of skills, motivations and outcomes in varied relational contexts (Rose-‐Krasnor, 1997). The generalized approach to defining subdomains, rather than a more detailed approach, is used for all Life Effectiveness subdomains. Generalized definitions for subdomains were chosen for the LEQ at least, in part, to prioritize instrument usability in terms of its length and complexity (Neill, 2008).

The theoretical specifications of the subdomains have important implications for the design of items. Qualitatively and specifically, item wording represents a single, high level of self-‐assessment for each generalized subdomain (see Figure 5). Social competence items include: “I am successful in social situations,” “I am competent in social situations,” and “I communicate well with other people” (emphasis added here). All item content appears to be

27

intentionally pegged to optimal levels of competence and success in social interactions. The opportunity for self-‐assessing competence as lower-‐than-‐optimal (e.g., moderate or low) is restricted to low-‐numbered response options, rather than specified through item content. Scores for all items are summed and averaged for the composite LEQ scale and the eight subscales.

The hypothetical outline provided in the blueprint can be empirically tested through analysis of the Wright map output for LEQ data.

Composite LEQ. To understand the relationships represented in the Wright Map of data for the composite LEQ (Figure 6) we note first that the vertical dashed line in the middle of the Wright map represents the Life Effectiveness construct along a continuous logit scale. Logits are the unit of measurement in which constructs are most commonly expressed in item response models, and these units are labeled on the far left side of the figure (i.e., 2 through -‐3). To the left of the vertical logit scale, respondent locations (i.e., person ability estimates) are signified by X’s. In this analysis using the UDPC model with cases constrained, each X represents approximately 23 respondents as noted at the bottom of the Wright Map. To the right of the vertical logit scale, item locations (i.e., item difficulty parameter estimates) are indicated with the LEQ item numbers 1 through 24.

The probabilistic relationship that exists between persons and items are expressed on the Wright map via their locations relative to one another on either side of the logit scale. That is, when an item and a person are at the same level or location on the logit scale, there is a 0.50 probability of a person at that level endorsing (or correctly responding to) the corresponding item. For polytomous items, the item-‐level interpretation needs to be magnified using the thresholds between response options, which is done later in this section through analysis of item threshold maps.

In the Wright map of items, a person located higher than a particular item on the scale will have greater than a 0.50 probability of endorsing the item; a person located lower on the scale will have less than a 0.50 probability of endorsing the item. Items with positive logits measure relatively higher levels of life effectiveness and items with negative logits measure relatively lower levels of life effectiveness. However, we must be cautious about saying that certain logit levels indicate a certain level of life effectiveness; the items are relatively easy and since they were written independently of a defined life effectiveness construct, the content of the items is not intrinsically linked to a meaningful amount or type of life effectiveness. In Figure 6 the logit of 0 corresponds to a respondent having a moderate amount of Life Effectiveness according to this instrument, because we can see that the mean location of the respondents is well above the location of the items. A respondent at logit 2 in this map has a relatively high level of life effectiveness when compared with a respondent at -‐2.

28

LEQ-‐H Composite UDPC MAP OF WLE ESTIMATES AND RESPONSE MODEL PARAMETER ESTIMATES ==================================================== Logits Respondents Items ---------------------------------------------------- | | | | | | | | | | | | | | 2 X| | | | X| | X| | X| | X| | X| | XX| | XX| | X| | 1 XXXX| | XXXX| | XX| | XXXXX| | XXXXX| | XXXXXX| | XXXXXXX| | XXXXXX| | XXXXXX| | XXXXXXXXX| | 0 XXXXXXX| | XXXXXXXXXX| | XXXXXX| | XXXXXXXXX|9 | XXXXXXXX| | XXXXXXX|5 17 | XXXXXXXXX|1 | XXXXXX|2 | XXXXXXX|6 10 14 21 | XXXXX|13 18 22 | XXXXX| | -1 XXX|4 8 16 | XX|3 7 15 20 23 24 | XX| | XX|11 19 | X| | X|12 | X| | | | | | | | -2 | | | | | | | | | | | | | | | | | | X| | ==================================================== Each 'X' represents 23.0 cases ====================================================

Figure 6. Wright Map of the composite (UDPC) LEQ.

29

One of the most striking features of this particular map of LEQ data is that all but one of

the items (Item 9) are located within the lower half of the logit scale (-‐0.5 to -‐1.5). The location of the item difficulty estimates in relation to person abilities here means that LEQ items represent low to moderately low amounts of Life Effectiveness and were easy for the majority of respondents to positively endorse, even prior to any intervention (all data are baseline). Recall, however, that item content for the subdomains reflected qualitatively high levels for each subdomain, or the highest level of competency for the subdomains that programs would like to observe as a result of an intervention. One would expect, then, for items that adequately represented high levels of competency to be much more difficult (and therefore located higher on the logit scale).

The Wright map cannot tell us why the items were easy for respondents, but it does provide evidence of a mismatch between the intended meaning of item content and the empirical location of items on the latent variable. It is also worth noting that the range of items is within a small, 1.25 logit spread (when compared with a 5 logit spread for persons). Despite the mismatch between person and item estimates, this range of items does capture the intent for item content to represent a single substantive level of life effectiveness.

The reasoning behind the decision to develop item content that was relatively homogeneous and not substantively ordered is not specified (Neill, 2008). Unordered items with interchangeable content may have been a de facto artifact of using a generalized (i.e., global-‐like) description for each of the subdomains, rather than a multidimensional or hierarchical description. It is just as likely that it arises out of the classical test theory approach to scale development, in which efforts to obtain high correlations through parallel items results in “superficial similarities” between items (DeVellis, 2006, p S57).

However, if a latent variable is theorized to exist along a linear continuum (e.g., high to low, more to less) and derived scores are expressed in a corresponding continuum, then theoretically one would expect that items should be constituted along a continuum as well. This continuum of the construct could be more meaningfully represented hierarchically, at qualitatively different levels, with the substantive content of the variable expressed in the wording of the items and the response options (Wilson, 2005a). In contrast to the type of item used in the LEQ, when a construct mapping approach (Wilson, 2005a) is used to create test content, instrument development involves intentionally connecting the meaning of item content and the location of items on a shared scale (cf., Brown & Wilson, 2011; Allen, 2010). Construct mapping allows the test developer to theorize about content relevance and representativeness for a construct, and to then test this theory through item response modeling.

Without an a priori hypothesis about substantive ordering, as is the case with the LEQ, we are precluded from using the Wright Map to compare the hypothesized sequencing of items with empirical sequencing. This, in turn, limits one of the primary opportunities for collecting validity evidence based on instrument content using the Rasch model. Nonetheless, when empirical evidence shows that there is no order to test content—whether or not the lack of order was intentional—this raises questions as to the adequacy with which items represent the content domain.

30

Since the LEQ uses polytomously scored items, a Wright Map of the thresholds of items (Figure 7) should also be examined for empirical evidence of domain content coverage. Item thresholds are the point of transition between response options at which there is an equal probability of a respondent responding in that category or below, versus responding in a higher category. For instance, in Figure 7, for Item 16 (“When I apply myself to something I am confident I will succeed”) a respondent who is located at logit -‐1.3 (threshold 16.4) had a 0.50 probability of selecting response category 4 or below versus response category 5 or above. Thus, for the life effectiveness scale, threshold 4 represents the point at which a respondent moves from the set of response options associated with the anchor label “FALSE not like me” to the set of response options “TRUE like me.”

31

================================================================================ LEQ-H Logits Respondents Item thresholds -------------------------------------------------------------------------------- | | |17.7 |9.7 |2.7 5.7 2 X| |1.7 X|10.7 21.7 X|14.7 18.7 22.7 X|13.7 20.7 X| X|6.7 XX|3.7 4.7 16.7 XX| X|12.7 1 XXXX| XXXX|7.7 9.6 11.7 19.7 23.7 XX|17.6 XXXXX|1.6 2.6 5.6 8.7 15.7 24.7 XXXXX|21.6 XXXXXX|10.6 XXXXXXX|14.6 XXXXXX|13.6 18.6 22.6 XXXXXX|6.6 XXXXXXXXX|20.6 0 XXXXXXX|4.6 9.5 16.6 XXXXXXXXXX|3.6 23.6 XXXXXX|7.6 8.6 17.5 XXXXXXXXX|1.5 2.5 5.5 15.6 24.6 XXXXXXXX|11.6 12.6 19.6 21.5 XXXXXXX|10.5 XXXXXXXXX|6.5 9.4 13.5 14.5 22.5 XXXXXX|18.5 XXXXXXX| XXXXX|1.4 4.5 5.4 8.5 16.5 17.4 23.5 XXXXX|2.4 3.5 7.5 9.3 15.5 24.5 -1 XXX|6.4 20.5 21.4 XX|10.4 13.4 14.4 22.4 XX|5.3 8.4 11.5 12.5 17.3 18.4 19.5 XX|1.3 15.4 16.4 23.4 X|2.3 3.4 4.4 6.3 7.4 9.2 21.3 24.4 X|10.3 13.3 22.3 X|5.2 8.3 14.3 18.3 19.4 20.4 |7.3 11.4 12.4 15.3 16.3 17.2 23.3 24.3 |1.2 3.3 4.3 |2.2 6.2 10.2 13.2 14.2 21.2 22.2 -2 |7.2 8.2 9.1 15.2 18.2 19.3 20.3 24.2 |4.2 5.1 11.3 16.2 23.2 |3.2 12.3 |20.2 24.1 |4.1 6.1 8.1 11.2 14.1 17.1 19.2 |10.1 22.1 |1.1 2.1 7.1 13.1 15.1 20.1 23.1 |11.1 16.1 18.1 19.1 21.1 |12.2 X|3.1 12.1

Each 'X' represents 23.0 cases. The labels for thresholds show the levels of item, and step, respectively. Figure 7. Wright map of composite LEQ item thresholds.

32

The distribution of item thresholds does a better job than the item estimates of providing evidence of domain content coverage over multiple levels of the Life Effectiveness construct continuum. It is worth noting, though, that most item thresholds (i.e., thresholds 1 – 5) assess low or moderate life effectiveness, and only thresholds 6 and 7 assess moderate and high life effectiveness. This distribution of item thresholds indicates that the LEQ content representation of the life effectiveness construct is skewed toward measuring low and moderate life effectiveness.

The Wright maps show that the composite LEQ relies on the item thresholds, rather than items, to represent the content continuum of the Life Effectiveness construct. Unfortunately, the thresholds are represented only by numbers and short verbal labels at the anchor points that have very limited interpretability. The anchor labels of “FALSE not like me” over the 1-‐4 response category options, and the “TRUE like me” over the 5-‐8 options can theoretically help respondents to disambiguate and differentiate the numbers. However, relying on numbers to represent the Life Effectiveness construct is problematic. The numbered response options in the LEQ are arbitrary metrics (Blanton & Jaccard, 2006), so there is no inherent meaning to how a one-‐unit change in an observed score on the LEQ reflects the magnitude of change on the Life Effectiveness construct. That is, what does it mean, in terms of life effectiveness, to go from a 4 to a 5, or a 5 to a 6? Just how much more effective in life is a respondent who scores a 6 versus a respondent who scores a 7? What skill or capacity did a person gain from a given intervention if they move from a 5 pre-‐course to a 6 post-‐course? Answers to these sorts of questions are difficult to ascertain without meaningful, specific response category labels that delineate landmarks in the development of life effectiveness.

Furthermore, unlabeled or vaguely labeled response category options are problematic in that they require individual respondents to interpret each number and imbue it with some unspecified qualitative meaning (Wilson, Allen, Li, 2006). In essence, the numbered response category format fuses: (a) the internal information that respondents process as they think about their answer to each question and (b) the coding of that same information (Carifio & Perla, 2007; Sudman, Bradburn & Schwarz, 1996; Tourangeau, Rips, Rasinski, 2000). Thus, both the process of giving meaning to numbers and the substantive meaning given to the numbers is internal to each respondent and unknown to the researcher. Interpretation of responses as evidence for validity based on instrument content is therefore primarily limited to an examination of the underlying continuum of “more” and “less” that is conveyed through the increasing value of the integers and their representation in item thresholds.

Subscales. Wright maps for each of the subscales show a poor spread of item estimates

compared with the spread of person estimates. Coverage of content is better, although still generally incomplete, when considering the item thresholds. Wright maps of item parameter estimates and item thresholds for all subdomains can be found in Appendix B. With only three items per subscale there is evidence for all subscales that there are consequential gaps in content coverage.

For example, the Wright map (Figure 8) for Active Initiative (AI) shows that all three AI items are at the same difficulty level and are low on the logit scale. Person AI competency ranges from approximately 3.5 logits to -‐4.25 logits, an eight logit spread. However, all the items are located at approximately -‐2.5, meaning that these items are very easy for

33

respondents to agree with and all measure the same low level of the AI construct. The Wright Map of item thresholds for AI (Figure 9) shows that thresholds are skewed toward the bottom of the scale, where they represent low life effectiveness content. Conversely, respondents are distributed more through the middle and top of the scale, where content representing moderately high life effectiveness is sparsely represented and content representing high life effectiveness is not represented at all.

=========================================================== LEQ ACTIVE INITIATIVE MAP OF WLE ESTIMATES AND RESPONSE MODEL PARAMETER ESTIMATES =========================================================== 4 | | | | XXXXXXXXXX| | | | 3 | | | | | | 2 XXXXX| | | | | | XXXXXX| | 1 | | XXXXXXXXXX| | | | | | 0 XXXXXXX| | | | XXXXXXX| | -1 | | XXXXXXXXX| | | | XXXXX| | -2 XXXX| | | | XXX|1 2 3 | XX| | -3 X| | X| | | | -4 X| | X| | | | ==================================================== Each 'X' represents 48.7 cases ====================================================

Figure 8. Wright Map for Active Initiative.

34

In reviewing the AI subscale items to see why content representation is low for items and item thresholds (see Figure 4 for item wording) all three items ask respondents to self-‐report whether they like to be active, rather than to report on their actual active behavior and initiative-‐taking. AI items may represent the low end of the AI construct because it may be easier to respond positively about a preference for being active and taking initiative than it is to respond to items that represent specific types of current active behavior, or the extent to which such active behavior is performed. Other personality and behavior instruments addressed issues related to ideal performance as compared with current or real performance by explicitly asking respondents to indicate how they would like to perform as compared with how they actually currently perform (Allen, 2007; Waugh, 2001).

================================================================================ LEQ-H ACTIVE INITIATIVE MAP OF WLE ESTIMATES AND THRESHOLDS ================================================================================ | 4 | | XXXXXXXXXX| | 3 | | | 2 XXXXX| |1.7 3.7 |2.7 XXXXXX| 1 | XXXXXXXXXX| | | 0 XXXXXXX|3.6 |1.6 2.6 XXXXXXX| -1 | XXXXXXXXX| | XXXXX|3.5 -2 XXXX|1.5 2.5 | XXX| XX|3.4 -3 X|1.4 2.4 X| | -4 X|1.3 2.3 3.3 X| | |1.2 2.2 3.2 -5 | | | |1.1 2.1 3.1 -6 | | ================================================================================ Each 'X' represents 48.7 cases. The labels for thresholds show the levels of item, and step, respectively ================================================================================

Figure 9. Wright Map for Active Initiative item thresholds.

35

Section 3: Validity Evidence Based on Internal Structure Assessment of the internal structure of the LEQ includes understanding how the many

parts of the instrument—including items, response categories, and dimensions—contribute to the overall functioning of the instrument, and how this functionality supports test score interpretation and use. The relationships among components that are central to the internal structure of the LEQ are examined in this section, beginning with evaluation of model fit, then a review of the distribution of items and persons along the shared scale of the construct, followed by an examination of item statistics and concluding with an analysis of differential item functioning.

Assessing fit for the Rating Scale versus the Partial Credit model. The rating scale model and partial credit model allow two types of overall item response patterns in the transition, or step, from one response option to another. In the LEQ, where there are eight ordered response options, or levels, there are seven steps between the eight different levels (i.e., the transition from the first response option to the second response option represents step one, and so on). The second step, from response option two to response option three, can only be taken if the first step from response option one to response option two has already been taken.

The use of sum scores with Likert-‐type scales such as the LEQ rests, in part, on the assumption that respondents interpret the intervals between response options equivalently, both within and across items. This assumption can be tested both statistically and substantively by analyzing data that are fit to a partial credit model. For instance, if it seems more theoretically plausible that respondents view moving from response category 7 to 8 more difficult than it is to move from response category 5 to 6, then the partial credit model may be more appropriate. Comparing the two models allows the researcher to determine the consistency with which respondents interpret and use response categories across all items.

The first part of this section examines whether the rating scale or partial credit model is the better fitting model when using the composite, multidimensional or consecutive approach to modeling LEQ data. Results are presented by comparing (a) the composite (unidimensional) rating scale (UDRS) model with the composite (unidimensional) partial credit (UDPC) model; (b) the multidimensional rating scale (MDRS) model with the multidimensional partial credit (MDPC) model; and (c) the consecutive (unidimensional subscale) rating scale (UDRS) model with the consecutive (unidimensional subscale) partial credit (UDPC) model for each of the eight subscales. For the composite and multidimensional models, as well as for seven of the eight subscales, the partial credit model fit the data best.

Results for the second part of this section use fit statistics and reliability estimates to compare (a) the best-‐fitting composite model with the best-‐fitting multidimensional model and (b) the best-‐fitting consecutive model with the composite and multidimensional models. The comparative process of assessing fit explores both the statistical strengths and weaknesses of the different models as applied to LEQ data.

36

Composite approach to rating scale and partial credit unidimensional models. Comparison of fit between models begins by looking at the difference in the number of parameters of each model. ConQuest output indicates that the UDRS model used 31 parameters and the UDPC model used 169. For the UDRS model, the parameters are one item difficulty parameter for each item, one parameter for each step (with eight response options there are seven steps, but ConQuest constrains the last step of each item so that there are only six identifiable step parameters), and one population variance (i.e., (24x1)+6+1=31). For the UDPC model, the parameters are one average item difficulty and six estimated steps from one response category to the next for each item and one population variance (i.e., ((24x1)+(24x6)+1=169).

The second step in evaluating fit is comparing the deviance to the number of parameters estimated to see if the better fit of the model is worth the sacrifice of the df (i.e., the increase in estimated parameters). The difference in deviance between the hierarchically nested models approximates a chi –square (χ2) distribution, and the difference in the number of parameters that are estimated for the two models provides the number of degrees of freedom (df). The lower the deviance, the better. The deviance is a statistic reported by ConQuest. For this comparison, the difference in deviance between the UDRS and UDPC models is 1,360, with df=138. When compared with the critical value for a chi-‐square distribution at p<0.001 (which is 195), the difference in deviance is highly statistically significant, suggesting that the partial credit model is a better fit than the rating scale model for the composite approach (Table 3).

Table 3 Model fit criteria for UDRS vs. UDPC Model Type Final

Deviance Final Estimated Parameters

Chi-‐square Test v. Previous Model

MLE/WLE/EAP

UDRS 258,595 31 N/A 0.93/0.93/0.92 UDPC 257,235 169 Change: 1,360;

df: 138, p<0.001 0.93/0.93/0.92

37

The difference in fit is augmented by looking at the weighted meansquare fit statistics (MNSQ) for the item and step parameters for the UDRS and UDPC models. All item parameters fit within acceptable ranges for both UD models. However, it is worth noting that for both models, Item 4 (“I change my thinking or opinions easily if there is a better idea”) has a relatively high MNSQ and significant t-‐values (MNSQ=1.31, t=11.2 for UDRS, MNSQ=1.32, t=10.9 for UDPC). All step parameters are in the acceptable range for the UDPC model (Table 4).

Table 4 Rating Scale vs. Partial Credit Unidimensional Models. Item Fit and Step Misfit (in percentages) Model % Smaller

Variation % Items with Reasonable Fit

% Higher Variation

% Response Category Parameters with Misfitting Steps and Significant t Values

(<0.75) (0.75<WMSE<1.33) (>1.33) UDRS 0 100 0 100 UDPC 0 100 0 0 However, for the UDRS model, all step parameters fit poorly with a MNSQ value >1.33 and t-‐values that are all highly significant, indicating that there is greater variance than expected for all steps of all items (Figure 10). This misfit is an artifact of trying to fit all items to the equal step pattern of the rating scale model.

38

============================================================================== UDRS TABLES OF RESPONSE MODEL PARAMETER ESTIMATES ============================================================================== TERM 1: item ------------------------------------------------------------------------------------------ VARIABLES UNWEIGHTED FIT WEIGHTED FIT --------------- ----------------------- ----------------------- item ESTIMATE ERROR^ MNSQ CI T MNSQ CI T ------------------------------------------------------------------------------------------ 1 TimeMgtPLAN -0.518 0.014 1.10 ( 0.95, 1.05) 4.0 1.07 ( 0.95, 1.05) 2.9 2 SocCompSUCC -0.576 0.014 0.99 ( 0.95, 1.05) -0.4 0.96 ( 0.95, 1.05) -1.5 3 AchMotDETAI -1.208 0.016 1.14 ( 0.95, 1.05) 5.6 1.13 ( 0.95, 1.05) 5.1 4 IntFlexCHAN -1.111 0.016 1.34 ( 0.95, 1.05) 13.1 1.31 ( 0.95, 1.05) 11.2 5 TskLdWORK -0.530 0.014 1.14 ( 0.95, 1.05) 5.9 1.11 ( 0.95, 1.05) 4.2 6 EmotContST -0.807 0.015 1.25 ( 0.95, 1.05) 9.7 1.25 ( 0.95, 1.05) 9.2 7 ActIniBUSY -1.279 0.017 1.05 ( 0.95, 1.05) 2.0 1.11 ( 0.95, 1.05) 4.0 8 SelfConABI -1.265 0.017 1.20 ( 0.95, 1.05) 8.2 1.26 ( 0.95, 1.05) 9.3 9 TimeMgtWAST -0.246 0.013 1.19 ( 0.95, 1.05) 7.5 1.16 ( 0.95, 1.05) 6.6 10 SocCompCOMP -0.704 0.015 0.94 ( 0.95, 1.05) -2.7 0.93 ( 0.95, 1.05) -2.9 11 AchMotRESUL -1.472 0.017 0.90 ( 0.95, 1.05) -4.3 0.92 ( 0.95, 1.05) -3.0 12 IntFlexNEW -1.437 0.017 0.95 ( 0.95, 1.05) -2.1 0.96 ( 0.95, 1.05) -1.5 13 TskLdLEADER -0.817 0.015 0.90 ( 0.95, 1.05) -4.4 0.91 ( 0.95, 1.05) -3.9 14 EmotContNEW -0.803 0.015 0.87 ( 0.95, 1.05) -5.7 0.89 ( 0.95, 1.05) -4.6 15 ActIniENRG -1.333 0.017 1.18 ( 0.95, 1.05) 7.1 1.23 ( 0.95, 1.05) 8.4 16 SelfConAPP -1.078 0.016 0.86 ( 0.95, 1.05) -6.1 0.88 ( 0.95, 1.05) -4.9 17 TimeMgtMANA -0.451 0.014 0.91 ( 0.95, 1.05) -3.8 0.90 ( 0.95, 1.05) -4.2 18 SlfCompCOMM -0.873 0.015 0.92 ( 0.95, 1.05) -3.4 0.93 ( 0.95, 1.05) -2.6 19 AchMotPOSSI -1.489 0.018 0.88 ( 0.95, 1.05) -5.2 0.93 ( 0.95, 1.05) -2.9 20 IntFlexADAP -1.133 0.016 0.83 ( 0.95, 1.05) -7.6 0.84 ( 0.95, 1.05) -6.5 21 TskLdMOTIVA -0.660 0.014 0.81 ( 0.95, 1.05) -8.7 0.81 ( 0.95, 1.05) -8.3 22 EmotContWRO -0.806 0.015 1.03 ( 0.95, 1.05) 1.4 1.04 ( 0.95, 1.05) 1.7 23 ActIniGETT -1.183 0.016 1.01 ( 0.95, 1.05) 0.5 1.05 ( 0.95, 1.05) 2.0 24 SelfConBELI -1.363 0.017 1.06 ( 0.95, 1.05) 2.6 1.10 ( 0.95, 1.05) 3.9 -------------------------------------------------------------------------------- An asterisk next to a parameter estimate indicates that it is constrained Separation Reliability = 0.998 Chi-square test of parameter equality = 96050.78, df = 24, Sig Level = 0.000 ================================================================================ TERM 2: step ------------------------------------------------------------------------------------------ VARIABLES UNWEIGHTED FIT WEIGHTED FIT --------------- ----------------------- ----------------------- step ESTIMATE ERROR^ MNSQ CI T MNSQ CI T ------------------------------------------------------------------------------------------ 0 266.57 ( 0.95, 1.05)695.1 3.36 ( 0.86, 1.14) 20.4 1 -1.150 0.010 9.14 ( 0.95, 1.05)139.5 2.21 ( 0.89, 1.11) 16.8 2 -1.096 0.010 2.33 ( 0.95, 1.05) 41.7 1.92 ( 0.92, 1.08) 18.5 3 -0.720 0.009 1.71 ( 0.95, 1.05) 24.9 2.01 ( 0.94, 1.06) 25.5 4 -0.391 0.008 1.82 ( 0.95, 1.05) 28.1 2.02 ( 0.95, 1.05) 30.8 5 0.153 0.007 2.08 ( 0.95, 1.05) 35.2 2.15 ( 0.95, 1.05) 37.0 6 1.126 0.008 2.30 ( 0.95, 1.05) 41.0 2.40 ( 0.95, 1.05) 42.0 7 2.079* 4.09 ( 0.95, 1.05) 76.6 3.05 ( 0.94, 1.06) 47.7 -------------------------------------------------------------------------------- An asterisk next to a parameter estimate indicates that it is constrained

Figure 10. Unidimensional rating scale (UDRS) weighted mean square fit statistics (MNSQ).

39

Multidimensional approach. For the MDRS model, a total of 66 parameters were estimated from 24 item difficulty parameters, six step parameters, eight population variances (population means are constrained to zero), and 28 covariance estimates. Altogether 204 parameters were estimated for the MDPC model: 24 item difficulties, 144 step parameters, 8 population variances, and 28 covariance estimates). For the likelihood ratio test, the difference in deviance between the MDRS and MDPC models is 768, with df=138, which is highly statistically significant (

Table 5). Table 5 Model Fit Criteria for MDRS vs. MDPC Model Final Deviance Final

Estimated Parameters

Chi-‐square Test v. Previous Model

Reliabilities MLE/WLE/EAP

MDRS 243,971 66 na D1: 0.80/0.79/0.82

D2: 0.77/0.77/0.83 D3: 0.68/0.69/0.79 D4: 0.69/0.69/0.76 D5: 0.78/0.78/0.86 D6: 0.78/0.78/0.83 D7: 0.73/0.75/0.85 D8: 0.74/0.75/0.88 MDPC 243,203 204 change:768, df:138 D1: 0.80/0.79/0.82 D2: 0.81/0.81/0.85 D3: 0.69/0.69/0.83 D4: 0.68/0.68/0.78 D5: 0.77/0.77/0.86 D6: 0.79/0.79/0.85 D7: 0.72/0.73/0.86 D8: 0.71/0.72/0.87

40

As with the UDRS model, for the MDRS model all item step parameters (100%) are above the acceptable range with highly significant t-‐values. In the MDPC there are also nine step parameters (6%) that are greater than 1.33 with t-‐values statistically significant at the 5% level (Table 6); all except one of which are at the first step. As this value is very close to the 5% that is expected by chance, it is not necessary to further investigate these. Looking at the item level, residuals for two items in the MDRS model vary more than expected, with t-‐values that are highly significant; Item 4 has a weighted meansquare of 1.45 with a t-‐value of 15.7, and Item 8 has a weighted meansquare of 1.37 with a t-‐value of 12.9. Item 4 also misfits for the MDPC model (MNSQ=1.35, t=12.2). These findings suggest that for the multidimensional approach to the data, as with the composite approach, the partial credit model provides a better fit for the data than the rating scale model.

Table 6 Rating Scale vs. Partial Credit Multidimensional Models. Item Fit and Step Misfit Model % Smaller

Variation % Items with Reasonable Fit

% Higher Variation

% Response Category Parameters with Misfitting Steps and Significant t Values

(<0.75) (0.75<MNSQ<1.33) (>1.33) MDRS 0 92 8 100 MDPC 0 96 4 6

41

Consecutive approach. In the consecutive approach, eight subscales from the LEQ are comprised of three items each, and each subscale is conceptualized as a stand-‐alone, unidimensional scale to which the rating scale and partial credit models are applied. For the UDRS model, 10 parameters were estimated: three item difficulty parameters, six step parameters, and one population variance (population means are constrained to zero). The UDPC model estimated 22 parameters: three item difficulty parameters, six step parameters for each of the three items, and one population mean (

Table 7). Table 7 Fit Criteria for UDRS vs. UDPC Models Using the Consecutive Approach

42

Dimension Model Final Deviance

Final Estimated Parameters

Chi-‐square Test v. Previous Model: Change in Parameters; df; Sig.

Person Separation Reliability -‐MLE/WLE/EAP

Time Management UDRS 34463 10 0.83/0.83/0.84 UDPC 34405 22 change: 58, df: 12;

p<0.001 0.83/0.83/0.84

Social Competence UDRS 31554 10 0.84/0.84/0.85 UDPC 31529 22 change: 25;df:12;

p<0.025 0.84/0.84/0.85

Achievement Motivation UDRS 29072 10 0.74/0.75/0.79

UDPC 29058 22 change: 14; df 12; not significant

0.74/0.75/0.80

Intellectual Flexibility UDRS 29977 10 0.73/0.73/0.77

UDPC 29906 22 change: 71; df 12; p<0.001

0.73/0.73/0.77

Task Leadership UDRS 33309 10 0.81/0.81/0.82 UDPC 33259 22 change: 50; df 12;

p<0.001 0.81/0.81/0.82

Emotional Control UDRS 31772 10 0.84/0.85/0.86 UDPC 31685 22 change: 87; df:12;

p<0.001 0.84/0.85/0.86

Active Initiative UDRS 29948 10 0.78/0.79/0.85 UDPC 29923 22 change: 25; df: 12;

p<0.025 0.78/0.79/0.85

Self Confidence UDRS 30924 10 0.76/0.77/0.82 UDPC 30749 22 change: 175; df

12; p<0.001 0.76/0.77/0.82

When using the consecutive approach, the partial credit model fit the data significantly better than the rating scale model at p<0.001 for five of the eight scales (TM, IF, TL, EC, SC) and at p<0.05 for two of the remaining scales (SC, AI). However, for the AM scale, there was no statistically significant difference between the partial credit and rating scale models. Thus, AM is the only scale for which there is evidence that the assumption holds that the numbers in the response alternatives are equally distant (

Table 7). In all eight scales, MNSQ item values show adequate fit for both the UDRS and UDPC models (

43

Table 8). However, in the UDRS model, the first step parameter of each scale fit poorly with a MNSQ value >1.33 and t-‐values that are all highly significant, whereas all step parameters fit adequately for the UDPC model. Table 8 Rating Scale vs. Partial Credit Unidimensional Models for the Consecutive Approach Dimension %

Smaller Variation

% Items with Reasonable Fit

% Higher Variation

% Response Category Parameters with Misfitting Steps and Significant t-‐Values

(<0.75) (0.75<MNSQ<1.33) (>1.33) UDRS Time Management 0 100 0 25 (first & last steps) Social Competence 0 100 0 13 (first step) Achievement Motivation 0 100 0 13 (first step) Intellectual Flexibility 0 100 0 13 (first step) Task Leadership 0 100 0 13 (first step) Emotional Control 0 100 0 13 (first step) Active Initiative 0 100 0 13 (first step) Self Confidence 0 100 0 13 (first step) UDPC Time Management 0 100 0 0 Social Competence 0 100 0 0 Achievement Motivation 0 100 0 0 Intellectual Flexibility 0 100 0 0 Task Leadership 0 100 0 0 Emotional Control 0 100 0 0 Active Initiative 0 100 0 0 Self Confidence 0 100 0 0

44

The likelihood ratio test and item and step parameter fit tests suggest that the partial credit model fits better than the rating scale model for seven of the eight LEQ subscales, but the advantage is less clear for the AM subscale. Given that there is some evidence of step misfit for the rating scale model, and little evidence that there would be a large effect size for use of either the partial credit or rating scale model, there is a slight preference for choosing the partial credit model from both a statistical standpoint and the vantage of analytical practicality (i.e., the same model can therefore be applied to all subscales). Additionally, and more importantly, findings in the following section suggest that the partial credit model is substantively a better fitting model. All analyses in the following results therefore apply the partial credit model to each of the three different model approaches to analyzing the LEQ data.

Comparing three model approaches. When comparing the strength and weaknesses of

different measurement models, issues for concern include statistical tests of fit as well as considerations about interpretability, reliability, and burden to respondents and programs. For any given instrument, simply adding more items can usually raise reliability. However, the trade-‐off of adding items to a scale is the cost in terms of development of additional items, the extra time and cognitive effort it takes for respondents to answer items, and the extra time it takes for programs to administer the scale. Moreover, error may be introduced if so many items are presented that respondents react by sufficing (i.e., picking any answer or response category, even if it is inaccurate) or leaving items unanswered. Thus, the applied use of the three models is considered below in addition to statistical tests of fit.

Composite vs. multidimensional models. The composite approach in this analysis uses

the partial credit unidimensional model and the multidimensional approach uses the partial credit multidimensional model. The difference in deviance between the two models is 14,032, with df=35 (Table 9). When compared with the critical value for a chi-‐square distribution at p<0.001 (which is 66.6), the difference in deviance is highly statistically significant, suggesting that the multidimensional model fits the data much better than the composite model.

Table 9 Comparing Model Fit Criteria for Three Analyses of LEQ Data

Approach/Model No. of

Parameters Final Deviance Composite -‐ UDPC 169 257,235

Consecutive -‐ (UDPC-‐ 8 combined) 176 250,514

Multidimensional -‐MDPC 204 243,203

The composite model aggregates 24 items from all eight subscales into one scale and produces a single score with one corresponding reliability estimate and standard error. The reliability of the UDPC using EAP scores (Mislevy, Beaton, Kaplan & Sheehan, 1992) was high at 0.92. Reliability coefficients for the multidimensional model rely on only three items per

45

subscale, and range across the eight subscales from 0.78 for the Intellectual Flexibility subscale to 0.87 for the Self Confidence subscale (Table 10).

Table 10 Comparison of Reliabilities for Three Models of LEQ Data Dimension Consecutive

Approach MLE/WLE/EAP

Multidimensional Approach MLE/WLE/EAP

Composite Approach MLE/WLE/EAP

Time Management 0.83/0.83/0.84 0.80/0.79/0.82 Social Competence 0.84/0.84/0.85 0.81/0.81/0.85 Achievement Motivation 0.74/0.75/0.80 0.69/0.69/0.83 Intellectual Flexibility 0.73/0.73/0.77 0.68/0.68/0.78 Task Leadership 0.81/0.81/0.82 0.77/0.77/0.86 Emotional Control 0.84/0.85/0.86 0.79/0.79/0.85 Active Initiative 0.78/0.79/0.85 0.72/0.73/0.86 Self Confidence 0.76/0.77/0.82 0.71/0.72/0.87 Life Effectiveness (whole scale)

NA NA 0.93/0.93/0.92

NA = not applicable A multidimensional model can often offer a useful compromise in reliability because correlational information from responses to items in other scales reduces measurement error due to randomness of responses. In this case, the moderate dimensional correlations in the MDPC model (Table 11) have only a surprisingly small effect on reliabilities (Table 10). The reliability of the UDPC model is considerably higher than that of the MDPC model, but the reliabilities of the MDPC model are well within a typically acceptable range (Nunnally & Bernstein, 1994). Table 11 Multidimensional Correlation Matrix MDPC CORRELATION MATRIX Dimension

Dimension TM1 SO2 AM3 IF4 TL5 EC6 AI7 SC8

TM1: Time Management 1.0 SO2: Social Competence 0.49 1.0

AM3: Achievement Motivation 0.73 0.51 1.0

IF4: Intellectual Flexibility 0.50 0.49 0.65 1.0 TL5: Task Leadership 0.59 0.76 0.61 0.57 1.0 EC6: Emotional Control 0.50 0.51 0.48 0.62 0.65 1.0 AI7: Active Initiative 0.61 0.57 0.68 0.59 0.66 0.56 1.0

SC8: Self Confidence 0.61 0.61 0.70 0.64 0.75 0.68 0.76 1.0

46

Interestingly, both item and step fit are roughly equivalent for the UDPC model and MDPC model, with no items or steps showing misfit for the UDPC model, but one item (Item 4) and nine steps (6%) showing misfit for the MDPC. Since this value is very close to the 5% that is expected by chance, it is unnecessary to investigate these instances of misfit further. Nonetheless, if we do look at item residuals for both models more carefully, we see, for instance, that Item 4 is just above the acceptable weighted meansquare fit range for the MDPC model (MNSQ=1.35, t=12.2), and justbarely within the acceptable fit range for the UDPC model (MNSQ=1.32, t=10.9). Practically speaking, this is a negligible difference. The residual information for Item 4 shows that, rather than substantiating the UDPC model as the better model, Item 4 is performing differently than other items in this instrument and should be flagged for further examination regardless of the model that is chosen. A quick review of Item 4 (“I change my thinking or opinions easily if there is a better idea”) shows that the item is double-‐barreled and also requires respondents to consider a qualifier (i.e., “easily”) as they think about their response to the presented scenario. The item will likely fit better for both models with a slight redesign (e.g., “I change my opinion if there is a better idea”).

Consecutive vs. multidimensional model. The consecutive approach in this analysis

uses the partial credit unidimensional model (UDPC) and the multidimensional approach uses the partial credit multidimensional model (MDPC). The difference in deviance between the two models is 7,310, with df=28, as shown in Table 9. When compared with the critical value for a chi-‐square distribution at p<0.001 (which is 56.9), the difference in deviance is highly statistically significant. The likelihood ratio test therefore suggests that the multidimensional model fits the data much better than the consecutive model. However, all items and all steps fit well for the UDPC model.

Consecutive models are typically disadvantaged with lower reliability scores than composite models, due to the higher number of items per calibration. Surprisingly, the consecutive reliabilities of the LEQ for the UDPC model compare favorably with the multidimensional model (Table 12). A similar finding of high subscale reliabilities occurred during a multidimensional and consecutive analysis of the internationally popular SF-‐36 Health Survey (Wilson, 2005b). The SF-‐36 Health Survey and the LEQ are both comprised of eight subscales. The dimensions and items in both instruments share a further similarity in that they were culled from multiple iterations of instrument development and concurrent factor analyses in which high reliability coefficients were prioritized (Wilson, 2011, personal conversation).

Wright map person-‐item distribution. A comparison of the Wright map for items

(Figure 6) and a Wright map for item thresholds for polytomous data (Figure 7) shows that the transitions between response categories do a better job of mapping out the structure of the LE scale than do the items. Results are examined first for the composite LEQ then for the subscales.

Composite LEQ Wright map. The location of item thresholds between categories 1 and

2, and 2 and 3 shows that these response option categories are exceptionally easy for respondents to bypass or endorse. A respondent with average life effectiveness ability (at 0

47

logits) has about a 90% probability of choosing a response level above 2. The Wright Map for the composite LEQ (Figure 6) indicates that person abilities range from about -‐3.0 logits to 2 logits—a five logit spread, but item difficulties only range from about -‐1.5 logits (Item 12) to -‐0.25 logits (Item 9). When looking at the Wright map of data for item thresholds (Figure 7), though, there is a better match between person and item threshold locations for the composite LEQ.

It is important that the range of item difficulties matches the range of person abilities on the Wright map, because item response models are most sensitive for respondents whose location on the scale is close to that of items. While there are no gaps for the composite LEQ, the skew of the thresholds is pronounced; more than 75% of the item thresholds are located below 0 logits (population mean estimates are constrained to 0 logits) and more than a third of the item thresholds are located at or below -‐1.7 logits, even though less than 1% of the sample is located there. This means that higher levels of life effectiveness will be measured less precisely than lower levels, and the precision of the lower levels of life effectiveness is unproductive since there are so few respondents located there. The standard error of measurement for person ability estimates is not constant in item response modeling as it is with the classical test theory approach. Rather, the standard error of measurement is conditional, depending on the relationship between the location of the person estimate and its proximity to item thresholds. The more item thresholds that are close to the location of a person estimate, the more precise the estimate will be. The shape of the relationship between the standard error of measurement and location for the composite LEQ can be seen in Figure 11. Each circle in the figure depicts a raw score and associated logit value. The standard errors of measurement for the composite LEQ are low across most respondent ability locations, which is a strength of the composite approach in general.

However, standard errors are larger for respondents with relatively higher levels of life effectiveness than for lower levels of life effectiveness. This problematic for trying to interpret change scores for the half of the respondents who begin an intervention with moderate or high life effectiveness. A respondent with a high (although not extreme) total raw score of 152 on the composite LEQ has a logit estimate of 1.45 with a standard error of 0.281. The 95% confidence interval (CI) is (0.934, 2.04), a range of 1.1 logits, which is about 25% of the width of the entire Wright map for respondent locations. So for a person who scored a 152, the range of their true raw score is anywhere between 144 and 158, a 14-‐point range. A respondent with a relatively low total raw score of 72 on the composite LEQ has a logit estimate of -‐1.23 with a standard error of 0.142. The 95% CI is (-‐1.51, -‐0.95), a range of 0.55 logits, or about half the range of a high-‐scoring respondent. This 95% CI translates to a raw score range of 63 to 90, or 27 points. Estimates of item parameters and their corresponding standard errors for the composite LEQ are available in Appendix C.

Although it seems counterintuitive that a smaller confidence interval would lead to a larger spread in raw scores for persons of lower life effectiveness, this is due to the nonlinear mapping of Rasch scaled scores to raw scores. A 0.25 logit difference in raw scores for the high LEQ scale is equivalent to roughly 13 raw score points, while a 0.25 logit difference in raw scores for the low LEQ scale is equivalent to roughly 26 raw score points. This nonlinear mapping stretches the raw score scale at the extremes of the ability distribution, which is how

48

the Rasch model approximates abilities that range from positive to negative infinity using observed scores that range from 0 to N.

Figure 11. The standard error of measurement for the composite LEQ scale subscale (each circle represents a different raw score by its associated logit).

The effect size (i.e., interpretive meaning) of the fit statistics for a particular

measurement model can be substantively assessed by looking at the relative pattern of thresholds within and across items, as illustrated in the Wright map for the UDPC model (Figure 12). Figure 12 contains the same information as Figure 7, but the reformatted layout helps to visually conceptualize and interpret the patterns of responses. Item threshold levels that are horizontally distinct within subdomains indicate clear differentiation among item responses within that subdomain; distinct item threshold levels across the entire instrument horizontally would indicate differentiation of levels across the broader life effectiveness construct.

49

Figure 12. Wright map for composite LEQ of item thresholds by subdomain. Each ‘X’ represents 23.0 cases.

50

Evidence that respondents did not interpret the categories equivalently, even at the extremes, is clear from the Wright map. For example, the line in Figure 12 traces the location of the transition from the fourth to fifth threshold (i.e., the point at which answering in the 5th category is as likely as answering in the 1-‐4 categories). The transition ranges across the entire lower half of the range of respondent locations. This same pattern is true even of the transitions at the extremes. Thus, the scaling assumption of the LEQ’s Likert-‐style response categories—namely, that respondents are interpreting the response options equivalently across items and that data are therefore of ratio or interval nature, which is a necessary condition to justify the use of averaged scores– is clearly not true in this case.

Evidence that respondents did not consistently respond in a uniform pattern to response categories, despite being offered a uniform set of response options, also has implications for use of the data. For instance, there is a .7 logit spread between the first and second thresholds item 3 (e.g., 3.1 to 3.2) in the AM subdomain, but only a .1 logit spread between the first and second thresholds item 12 (e.g., 12.1 and 12.2) of the IF subdomain. This means that it is approximately seven times more difficult (in the logit metric) for respondents to move up from the lowest level of achievement motivation than to move up from the lowest level of intellectual flexibility. This could imply that programs would need to expend considerably more resources to positively affect students at the lowest levels of achievement motivation than at the lowest levels of intellectual flexibility.

Even within a single subdomain, the pattern of thresholds across items is not always stable. For instance, thresholds do not align well within the self-‐confidence subdomain. These patterns imply that interpretation of the data would be qualitatively and substantively different between items, even within a subdomain. An example of good alignment of thresholds across items within a single subdomain, with no overlapping of threshold levels, can be seen in the AI subdomain. However, within each item of the AI subdomain the pattern between thresholds is not uniform. There is about a .3 logit spread between the third and fourth thresholds of AI items (7.3 to 7.4; 15.3 to 15.4; 23.3 to 23.4), but there is about a 1 logit spread between the sixth and seventh thresholds (7.6 to 7.7; 15.6 to 15.7; 23.6 to 23.7).

These patterns represent that respondents require approximately three times the amount of active initiative (in the logit metric) to move from the sixth to the seventh thresholds of AI as they do to move from the third to the fourth thresholds of AI. This can be interpreted to mean that programs will have to expend more resources to help increase active initiative levels for participants who begin programs with relatively high amounts of active initiative than for participants with moderately low amounts of active initiative.

These examples of variations in response patterns illustrate why, both statistically and substantively, the various fit test results for the composite approach show that the partial credit model provides a better fit for the data than the rating scale model, and that computing and interpreting summed and averaged LEQ scores is problematic.

51

LEQ subscale Wright maps. As noted in the analysis of the composite LEQ, gaps in coverage between persons and items indicate areas of the construct where respondents are being measured with less precision than in other parts of the scales. Unlike the composite LEQ, which had no gaps, all of the subscales exhibited meaningful gaps or mismatches in coverage. Additionally, because there are only three items for each subscale, even without complete gaps the item thresholds are spread more sparsely across the high end of the construct.

As an illustrative example, consider the Wright Map for AI (Figure 13), which shows a fairly typical distribution pattern among the subscales. (Wright Maps of item parameter estimates and item thresholds for all subdomains can be found in Appendix B.) Figure 13 shows that the range of item thresholds does an adequate job of matching person locations overall, but with some important limitations. Most importantly, there are no item thresholds (i.e., there are gaps in thresholds) for persons with relatively high amounts of AI. The entire top half of the respondent distribution is matched by only four of the twenty-‐one possible thresholds. There are many more item thresholds for persons with low amounts of AI, but there are no respondents located near the lowest six thresholds.

52

================================================================================ Active Initiative MAP OF WLE ESTIMATES AND RESPONSE MODEL PARAMETER ESTIMATES ================================================================================ Logits Cases Raw Score Item Thresholds ---------------------------------------------------- 6 | | 1 | | 1 | | 1 | | 1 5 | | 1 | | 1 | | 1 | | 1 4 | | 1 | | 1 XXXXXXXXXX| 21 | X (21) | | 1 3 | | 1 1 | | 1 1 | | 1 1 2 XXXXX| 20 | 1 1 | |1.7 3.7 1 1 | | 2.7 1 1 XXXXXX| 19 | X (19) 1 1 | | 1 1 XXXXXXXXXX| 18 | 1 1 | | 1 1 | | 1 0 XXXXXXX| 17 | 3.6 1 | |1.6 2.6 1 XXXXXXX| 16 | -1 | | 1 XXXXXXXXX| 15 | 1 | | 1 XXXXX| 14 | 3.5 1 -2 XXXX| 13 |1.5 2.5 x (13) | | 1 XXX| 12 | 1 XX| 11 | 3.4 1 -3 X| |1.4 2.4 1 X| 9 | | | -4 X| 7 |1.3 2.3 3.3 X| 6 | | | | |1.2 2.2 3.2 -5 | | | | | | | |1.1 2.1 3.1 -6 | | | | ==================================================== Each 'X' represents 48.7 cases

Figure 13. Wright map of item thresholds and 95% confidence intervals for the AI subscale.

53

The graph in Figure 14 shows that estimates are most precise at very low levels of AI and much less precise at moderate and high levels of AI. This pattern of threshold distribution means that a respondent in the middle of the distribution, with an ability estimate of 0 logits (corresponding with a raw score of 17), has a standard error of .77 and a 95% confidence interval (-‐1.5-‐, 1.53). This confidence interval spreads three logits, or about a third of the full range of respondent locations on the AI scale. This range gives us more information than if there were no data on the respondent, but simultaneously indicates that there is considerable uncertainty in that person’s ability estimate. Examples of 95% confidence intervals for persons with AI raw scores of 13, 19, and 21 are indicated in Figure 134.

A similar pattern of mismatch between respondent and item threshold locations is evident for all eight subscales. (Item parameter estimates and corresponding standard errors for all subscales are available in Appendix C.). All subscales exhibit standard errors that are moderately large to large5 for respondents of all ability estimates, and confidence intervals that span several logits. This uncertainty around respondent scores is not typically taken into account, but should not be ignored. The size of the confidence intervals indicates that individual ability estimates from all of the subscales must be treated with considerable caution, especially when these estimates are used to assess change over time.

Figure 14. Standard error of measurement for the Active Initiative subscale ( each diamond represents a different raw score by its associated logit).

4With reference to the distribution of respondent ability for any given scale, recall that data were collected on the first day of a given intervention. Additionally, Neill (2008) found that average raw scores on this initial date were depressed when compared with LEQ raw scores from respondents that were gathered approximately one month prior to the start of the intervention. 5 Standard errors are also potentially underestimated due to the large number of response category options per item.

0

0.2

0.4

0.6

0.8

1

1.2

-‐7 -‐5 -‐3 -‐1 1 3

Standard Error

Active Initiative θ Estimate

54

Average Measure Values. In addition to the fit statistics for items and item steps that are reviewed when examining overall model fit, an important source of information about the functioning of items in an instrument is the average measure values. These values are produced in the item analysis of ConQuest output along with other traditional statistics, such as point biserials and discrimination values. Respondents who score higher on the overall construct of Life Effectiveness should generally score higher (i.e., select a higher response category) on each item. Average measure values provide the average overall LEQ theta estimates, i.e., the means of the locations, for all respondents in the sample within each response category (Wilson, 2005a; Bond & Fox, 2007). All items for which this pattern does not hold should be investigated to determine the underlying cause or explanation for the inconsistency (Wilson, 2005a).

Typically, average measure values that are out of order (i.e., disordered categories) are an indication of item misfit (Linacre, 1999). Thus, the requirement for average measures to increase monotonically with category is usually regarded as essential to measurement accuracy (Linacre, 2002). One potential source of misfit is that items do not hold to the Rasch model assumption that all items measure only one trait. The estimates of person ability and item difficulty produced by the Rasch model are meaningful only if each item on the test is measuring a single attribute or latent trait (Bond & Fox, 2007). The Rasch model specifies that the probability of a person successfully responding to an item “is governed by the product of the ability of the person and the easiness of the item and nothing more” (Wright & Panchapakesan, 1969). Items that do not conform to this principle of unidimensionality will not fit the model well.

For the composite LEQ, there are many instances in which the mean location values at the lower end of the scale are slightly out of expected order (these values are shown in bold in Table 12). However, and interestingly, all LEQ items for the composite model have acceptable fit statistics. This indicates that the disordering here is not necessarily due to misfit arising out of unexpected observed variance. Rather, it may be that this category disordering is related to the small frequency counts in the lower categories (Wilson, Allen, Li, 2006).

55

Table 12 Average Measure Values for LEQ Data in Response Categories 1-‐8

Mean Location Values within Categories Item 1 2 3 4 5 6 7 8 1 -‐0.78 -‐0.79 -‐0.67 -‐0.45 -‐0.23 0.1 0.42 0.77 2 -‐1.04 -‐0.82 -‐0.73 -‐0.53 -‐0.22 0.08 0.39 0.85 3 -‐1.12 -‐0.78 -‐0.85 -‐0.68 -‐0.42 -‐0.18 0.16 0.56 4 -‐0.75 -‐0.82 -‐0.56 -‐0.53 -‐0.29 -‐0.16 0.16 0.49 5 -‐0.62 -‐0.7 -‐0.69 -‐0.49 -‐0.27 0.06 0.47 0.81 6 -‐0.88 -‐0.79 -‐0.68 -‐0.54 -‐0.32 -‐0.02 0.25 0.68 7 -‐1.23 -‐1.18 -‐0.96 -‐0.82 -‐0.49 -‐0.22 0.15 0.58 8 -‐1.14 -‐1.05 -‐1.06 -‐0.75 -‐0.47 -‐0.19 0.14 0.53 9 -‐0.72 -‐0.68 -‐0.58 -‐0.36 -‐0.12 0.22 0.5 0.89 10 -‐1.1 -‐0.92 -‐0.77 -‐0.63 -‐0.3 0.03 0.4 0.76 11 -‐0.95 -‐1.39 -‐1.08 -‐0.95 -‐0.63 -‐0.27 0.11 0.55 12 -‐2.16 -‐1.02 -‐0.95 -‐0.79 -‐0.6 -‐0.23 0.11 0.52 13 -‐0.91 -‐1.04 -‐0.95 -‐0.67 -‐0.38 -‐0.05 0.36 0.82 14 -‐1.03 -‐0.91 -‐0.87 -‐0.66 -‐0.39 -‐0.03 0.39 0.8 15 -‐1.15 -‐1.31 -‐0.89 -‐0.79 -‐0.47 -‐0.22 0.11 0.53 16 -‐1.47 -‐1.11 -‐1.02 -‐0.83 -‐0.47 -‐0.16 0.28 0.69 17 -‐0.8 -‐0.87 -‐0.69 -‐0.55 -‐0.21 0.14 0.51 0.98 18 -‐1.24 -‐1.06 -‐0.84 -‐0.68 -‐0.39 -‐0.03 0.33 0.71 19 -‐1.61 -‐0.86 -‐1.21 -‐0.94 -‐0.65 -‐0.28 0.07 0.6 20 -‐1.45 -‐1.16 -‐1.02 -‐0.76 -‐0.49 -‐0.14 0.24 0.66 21 -‐1.07 -‐0.94 -‐0.88 -‐0.65 -‐0.35 0.07 0.47 0.89 22 -‐1.01 -‐0.82 -‐0.71 -‐0.64 -‐0.36 -‐0.04 0.36 0.74 23 -‐1.27 -‐1.22 -‐1.04 -‐0.76 -‐0.51 -‐0.18 0.2 0.61 24 -‐1.02 -‐1.13 -‐1.08 -‐0.91 -‐0.58 -‐0.27 0.12 0.6

Problems with proportions of responses across categories can be seen, for example, in , which shows average measure values (PV1AVG:1) along with other traditional item

statistics for Item 11 (“I try to get the best results when I do things”). The first four response categories for Item 11 contain responses from less than 5% of the sample.

56

Table 13 Item:11 (Achievement Motivation: “I try to get the best results when I do things.”)

Cases for this item: 3,631 Discrimination: 0.62 Weighted MNSQ: 0.93 Item Threshold(s): -‐2.72 -‐2.41 -‐2.05 -‐1.70 -‐1.22 -‐0.33 0.95

Item Delta(s): -‐1.70 -‐2.74 -‐2.00 -‐1.83 -‐1.44 -‐0.37 0.69

Label Score Count % of tot Pt Bis t (p) PV1Avg:1 PV1 SD:1

0 0.00 6 0.17 -‐0.07 -‐4.30(0.000) -‐0.95 0.91 1 1.00 8 0.22 -‐0.11 -‐6.71(0.000) -‐1.39 0.60 2 2.00 38 1.05 -‐0.19 -‐11.80(0.000) -‐1.08 0.54 3 3.00 110 3.03 -‐0.28 -‐17.49(0.000) -‐0.95 0.44 4 4.00 346 9.53 -‐0.31 -‐19.85(0.000) -‐0.63 0.50 5 5.00 973 26.80 -‐0.20 -‐11.98(0.000) -‐0.27 0.50 6 6.00 1286 35.42 0.17 10.27(0.000) 0.11 0.58 7 7.00 864 23.80 0.41 26.84(0.000) 0.55 0.73

The disordered thresholds and problems evidenced in the item characteristic curves for Item 11 (Figure 15) directly reflect the low frequency counts (Adams, Wu, Wilson, 2012). The item characteristic curves in Figure 15

illustrate that the first two response categories had a very low probability of being selected by respondents at any point along the latent trait of Life Effectiveness.

57

Figure 15 Item Characteristic Curves for Item 11

Substantive concerns about disordered average measure values in the composite LEQ can likely be disregarded in the composite LEQ due to the small samples at the extremes. Nonetheless, while the average measure values look good overall, it would be worth exploring to determine if the underlying issues that are evident with disordered categories and disordered thresholds could be resolved by collapsing some of the categories and rechecking the item statistics to see if functioning is improved (Wright & Linacre, 1992). Collapsing categories would also address potential problems with estimate stability and precision that can arise with low frequency counts in categories6 (Linacre, 2002). It is important to note that since the overall sample size here is large (n=3,634), the low frequency counts at the extremes are likely the result of over-‐categorization (Linacre, 1999). Over-‐categorization occurs when there are too many response category options for respondents to meaningfully differentiate among them, as appears to be the case with the LEQ. It is well known that scales with more items and more categories are more likely to attain higher reliability (Weng, 2004). Each of the response categories acts, in effect, like a binary item, so that adding more response categories instead of items increases the reliability of the instrument. The problem with over-‐categorization is that it can lead to artificially reduced standard errors and inflated reliabilities (Linacre, 2012). When high reliabilities are a statistical artifact of over-‐categorization, this weakens the validity argument based on internal structure. The creation of artificially reduced standard errors and inflated reliabilities is a distinct possibility with the LEQ, particularly for the LEQ subscales which have unexpectedly high reliabilities (i.e., ranging from 0.73 to 0.85) for the short, individual 3-‐item subscales.

Differential item functioning. Results from DIF analyses for three groups of interest (gender, age, voluntary status) are presented first for the composite LEQ, followed by results for the subscales. Findings showed no evidence of intermediate or large DIF for the composite 6 Categories were not collapsed for this paper, because analysis here concern appraisal of the validity evidence for the instrument as it is currently developed and used.

58

scale or for seven of the eight subscales. Only Item 1 of the Task Leadership subscale showed evidence of intermediate DIF by age and voluntary status, and we could expect this significant finding by chance. Nonetheless, Item TL1 is reviewed and implications for retaining or removing the item are discussed. For all sets of items, there is no meaningful evidence of construct-‐irrelevant variance that would threaten the valid interpretation and use of scores.

Composite LEQ DIF. An omnibus test of parameter equality was used for the composite

LEQ to test for gender, age, and voluntary status DIF (see Appendix D for composite DIF tables for each group). The group by item interaction was statistically significant for each of the group categories. Multiple items within each group analysis showed statistically significant DIF, but the DIF effect sizes for all items in the composite LEQ were negligible. The TL1 item for the voluntary and age groups had the highest effect size of all statistically significant items (0.388 for voluntary and 0.402 for respondents younger than 19 when compared with respondents older than 24), again highlighting potential problems with this item.

59

Subscales: Gender DIF. An omnibus test of parameter equality was used for each of the eight LEQ subscales to test for gender DIF (N=male 2161, female 1452). The gender by item interaction resulted in statistically significant differences between the models for five subscales (Table 14).

Table 14 Gender DIF by LEQ Subscale Subscale x Gender DIF

LEQ -‐ H Deviance


Chi-‐square Test v. Previous Model: Change in Parameters; df

Sig >0.05

TIME MANAGEMENT item-‐gender+item*step 34,169 23 item-‐gender+ item*gender+ item*step 34,165 25 change:4; df:2 no SOCIAL COMPETENCE item-‐gender+item*step 31,240 23 item-‐gender+ item*gender+ item*step 31,226 25 change: 14; df2 yes ACHIEVEMENT MOTIVATION item-‐gender+item*step 28,857.1 23 item-‐gender+ item*gender+ item*step 28,852.2 25 change: 4.91; df2 no INTELLECTUAL FLEXIBILITY item-‐gender+item*step 29,694 23 item-‐gender+ item*gender+ item*step 29,684 25 change:10; df2 yes TASK LEADERSHIP item-‐gender+item*step 33,038 23 item-‐gender+ item*gender+ item*step 33029 25 change:9; df2 yes EMOTIONAL CONTROL item-‐gender+item*step 31,345.9 23 item-‐gender+ item*gender+ item*step 31,341.5 25 change:4.46; df 2 no ACTIVE INITIATIVE item-‐gender+item*step 29,744 23 item-‐gender+ item*gender+ item*step 29,720 25 change:24; df2 yes SELF CONFIDENCE item-‐gender+item*step 30,520 23 item-‐gender+ item*gender+ item*step 30,480 25 change:40; df2 yes

60

Eight items across the five subscales showed DIF when using the formula of two times the standard error of the absolute value of the DIF parameter. However, the magnitude of all DIF was negligible in all cases (Table 15). It is the magnitude of the DIF that determines the practical effect, or substantive importance, of the DIF. For instance, the first Social Competence item (“I am successful in social situations”) is more difficult for females than males, but the difference estimate is only 0.08 logits (p=.05). If all three Social Competence items had DIF of this magnitude, it would shift the overall male ability distribution higher by 8% of a standard deviation, which is a very small effect. With only one item in this subscale having DIF, the effect is even smaller.

Table 15 Gender DIF by LEQ Item per Subscale

LEQ -‐ H Subscales Item Gender DIFpar SE Sig

Effect Size Magnitude*

Social Competence SocCompSUCC male -‐0.04 0.018 yes 0.08 negligible

SocCompCOMP male -‐0.03 0.018 no na na

SocCompCOMM male 0.069 constrained

0.138 negligible Intellectual Flexibility IntFlexCHAN male -‐0.05 0.017 yes 0.102 negligible

IntFlexNEW male 0.047 0.018 yes 0.094 negligible

IntFlexADAP male 0.004 constrained

0.008 negligible Task Leadership TskLdWORK male -‐0.05 0.016 yes 0.094 negligible

TskLdLEADE male 0.017 0.016 no 0.034 negligible

TskLdMOTIV male 0.03 constrained

0.06 negligible Active Initiative ActIniBUSY male 0.091 0.019 yes 0.182 negligible

ActIniENRG male -‐0.06 0.019 yes 0.114 negligible

ActIniGETT male -‐0.03 constrained

0.068 negligible Self Confidence SelfConABI male 0.079 0.017 yes 0.158 negligible

SelfConAPP male -‐0.11 0.018 yes 0.216 negligible SelfConBELI male 0.029 constrained

0.058 negligible

*Magnitude: negligible= ES<.426; intermediate=.426<ES>.638; large=ES>.638

Subscales: Voluntary status DIF. DIF analysis for voluntary versus non-‐voluntary enrollment status resulted in a significant group by item interaction for five LEQ subcales (Table 16Table 16). Of the 14 items showing DIF, the magnitude for 13 of these was negligible (Table 17). One item in the Task Leadership subscale (TL Item 1: “I can get people to work for me”) showed a magnitude of “intermediate.” When controlling for all participants’ task leadership ability on the other two items in the subscale, non-‐voluntarily enrolled participants found this item significantly more difficult to agree with, or more difficult to rate themselves

61

highly on, than did voluntarily enrolled participants. If all items in the Task Leadership subscale exhibited the same behavior, the estimated mean score for voluntary participants would be 0.572 logits higher than that of non-‐voluntary participants, which is more than 50% of a standard deviation. This item is therefore worth flagging for further review.

Table 16 Voluntary Status DIF by LEQ Subscale

LEQ -‐ H Subscale and Model -‐ Voluntary DIF Deviance Parameters

Chi-‐square Test v. Previous Model: Change in Parameters; df Sig=>0.05

TIME MANAGEMENT item-‐voluntary+item*step 31,897 23 item-‐voluntary+ item*voluntary+ item*step 31,874 25 change: 23, df:2 yes SOCIAL COMPETENCE item-‐voluntary+item*step 29,137 23 item-‐voluntary+ item*voluntary+ item*step 29,109 25 change:28, df:2 yes ACHIEVEMENT MOTIVATION item-‐voluntary+item*step 26,981 23 item-‐voluntary+ item*voluntary+ item*step 26,977 25 change: 4, df:2 no INTELLECTUAL FLEXIBILITY item-‐voluntary+item*step 27,733 23 item-‐voluntary+ item*voluntary+ item*step 27,733 25 change: 0, df:2 no TASK LEADERSHIP item-‐voluntary+item*step 30,844 23 item-‐voluntary+ item*voluntary+ item*step 30,673 25 change:171, df:2 yes EMOTIONAL CONTROL item-‐voluntary+item*step 29,337 23 item-‐voluntary+ item*voluntary+ item*step 29,327 25 change:10, df:2 yes ACTIVE INITIATIVE item-‐voluntary+item*step 27,663 23 item-‐voluntary+ item*voluntary+ item*step 27,661 25 change:2 ,df:2 no SELF CONFIDENCE item-‐voluntary+item*step 28,525 23 item-‐voluntary+ item*voluntary+ item*step 28,494 25 change:31, df:2 yes

62

Table 17 Voluntary Status DIF by LEQ Item per Subscale

Subscale Item DIFpar SE Sig Effect Size Magnitude*

Time Management TimeMgtPLAN -‐0.064 0.022 yes 0.128 negligible TimeMgtWAST 0.102 0.021 yes 0.204 negligible TimeMgtMANA -‐0.038 constrained

0.076 negligible

Social Competence SocCompSUCC -‐0.019 0.026 no na na SocCompCOMP 0.129 0.026 yes 0.258 negligible SocCompCOMM -‐0.11 constrained

0.22 negligible

Task Leadership TskLdWORK 0.286 0.022 yes 0.572 intermediate TskLdLEADER -‐0.191 0.023 yes 0.382 negligible TskLdMOTIVA -‐0.095 constrained

0.19 negligible

Emotional Control EmotContSTR 0.055 0.025 yes 0.11 negligible EmotContNEW -‐0.079 0.025 yes 0.158 negligible EmotContWRO 0.024 constrained

0.048 negligible

Self-‐Confidence SelfConABI 0.102 0.024 yes 0.204 negligible SelfConAPP 0.026 0.025 no 0.052 negligible SelfConBELI -‐0.128 constrained

0.256 negligible

*Magnitude: negligible= ES<.426; intermediate=.426<ES>.638; large=ES>.638 In a review of item TL1, there is no obvious reason why this item is more difficult for

involuntary participants compared with voluntary participants. However, the item subtly differs from the other two task leadership items in that it asks respondents about their ability to get other people to do something (work) for them, rather than what the respondent is able to do themselves (lead and motivate others) when a task needs to be done. In the problem item, the focus is more on the respondent than on the tasks, and there is a slight sense that this item taps into a respondent’s ability to manipulate people into doing work for them, rather than their ability to lead other people in specific tasks. The item may highlight subtle differences in relationships that are different for involuntary versus voluntary program participants. Factors influencing the decision about whether to retain this item are discussed below.

63

Subscales: Age DIF. DIF analysis for age resulted in a significant group by item

interaction for five of the eight LEQ subcales (Table 18). There were 24 items across the five subscales that showed DIF, 23 of which were of negligible effect size (Table 19). However, the same Task Leadership item (TL Item 1: “I can get people to work for me”) that showed intermediate DIF for the involuntary respondents also showed intermediate DIF for the “Under 19” respondents (Table 19). In this case, participants who were younger than 19 years old found it significantly more difficult to agree with this item, or to rate themselves highly on the item, than did participants who were aged 25 years or older, when controlling for their task leadership ability on the other two items in the subscale. The difference estimate for this item is 0.468 logits between the two age groups, which is close to 50% of a standard deviation among respondents.

64

Table 18 Age Category DIF by LEQ Subscale

LEQ -‐ H Subscale and Model Deviance Parameters

Chi-‐square Test v. Previous Model: Change in Parameters; df Sig >0.05

TIME MANAGEMENT item-‐agecat3+item*step 34,270 24 no item-‐agecat3+ item*agecat3+ item*step 34,265 28 change:5; df:4 SOCIAL COMPETENCE item-‐agecat3+item*step 31,371 24 item-‐agecat3+ item*agecat3+ item*step 31,356 28 change:15, df:4 yes ACHIEVEMENT MOTIVATION item-‐agecat3+item*step 28,945 24 change:14, df:4 yes item-‐agecat3+ item*agecat3+ item*step 28,931 28 INTELLECTUAL FLEXIBILITY item-‐agecat3+item*step 29,790 24 change:6, df:4 no item-‐agecat3+ item*agecat3+ item*step 29,784 28 TASK LEADERSHIP item-‐agecat3+item*step 33,111 24 item-‐agecat3+ item*agecat3+ item*step 32,950 28 change:161, df:4 yes EMOTIONAL CONTROL item-‐agecat3+item*step 31,523 24 change:7, df:4 no item-‐agecat3+ item*agecat3+ item*step 31,516 28 ACTIVE INITIATIVE item-‐agecat3+item*step 29,783 24 item-‐agecat3+ item*agecat3+ item*step 29,765 28 change:18, df:4 yes SELF-‐CONFIDENCE item-‐agecat3+item*step 30,602 24 item-‐agecat3+ item*agecat3+ item*step 30,567 28 change:35, df:4 yes

65

Table 19 Age Category DIF for LEQ Items by Subscale Subscale/Item Age DIFpar SE Sig Effect

Size 1* Magnitude*** Effect

Size 2** Magnitude***

Social Competence

SocCompSUCC <19 -‐0.051 0.018 yes 0.046 negligible 0.107 negligible

SocCompSUCC 19-‐24 -‐0.005 0.02 no

SocCompSUCC >24 0.056 0.027 yes 0.107 negligible 0.061 negligible

SocCompCOMP <19 0.078 0.018 yes 0.132 negligible 0.102 negligible

SocCompCOMP 19-‐24 -‐0.054 0.02 yes 0.132 negligible 0.03 negligible

SocCompCOMP >24 -‐0.024 0.027 no

SocCompCOMM <19 -‐0.027 0.026 no

SocCompCOMM 19-‐24 0.059 0.028 yes 0.086 negligible 0.091 negligible

SocCompCOMM >24 -‐0.032 0.039 no Achievement Motivation

AchMotDETAI <19 0.076 0.019 yes 0.086 negligible 0.143 negligible

AchMotDETAI 19-‐24 -‐0.01 0.02 no

AchMotDETAI >24 -‐0.067 0.028 yes 0.143 negligible 0.057 negligible

AchMotRESUL <19 -‐0.017 0.019 no

AchMotRESUL 19-‐24 -‐0.026 0.021 no

AchMotRESUL >24 0.043 0.029 no

AchMotPOSSI <19 -‐0.059 0.027 yes 0.095 negligible 0.082 negligible

AchMotPOSSI 19-‐24 0.036 0.029 no

AchMotPOSSI >24 0.023 0.04 no

Task Leadership

TskLdWORK <19 0.264 0.016 yes 0.324 negligible 0.468 intermediate

TskLdWORK 19-‐24 -‐0.06 0.018 yes 0.324 negligible 0.144 negligible

TskLdWORK >24 -‐0.204 0.025 yes 0.468 intermediate 0.144 negligible

TskLdLEADER <19 -‐0.127 0.017 yes 0.163 negligible 0.218 negligible

TskLdLEADER 19-‐24 0.036 0.018 yes 0.163 negligible 0.055 negligible

TskLdLEADER >24 0.091 0.025 yes 0.218 negligible 0.055 negligible

TskLdMOTIVA <19 -‐0.137 0.023 yes 0.16 negligible 0.25 negligible

TskLdMOTIVA 19-‐24 0.023 0.026 no

TskLdMOTIVA >24 0.113 0.035 yes 0.25 negligible 0.09 negligible

66

Table 19 – Continued Subscale/Item Age DIFpar SE Sig Effect

Size 1* Magnitude*** Effect

Size 2** Magnitude***

Active Initiative

ActIniBUSY <19 0.074 0.019 yes 0.044 negligible 0.178 negligible

ActIniBUSY 19-‐24 0.03 0.021 no

ActIniBUSY >24 -‐0.104 0.028 yes 0.178 negligible 0.134 negligible

ActIniENRG <19 -‐0.028 0.019 no

ActIniENRG 19-‐24 -‐0.046 0.02 yes 0.018 negligible 0.12 negligible

ActIniENRG >24 0.074 0.028 yes 0.102 negligible 0.12 negligible

ActIniGETT <19 -‐0.046 0.027 no

ActIniGETT 19-‐24 0.016 0.029 no

ActIniGETT >24 0.03 0.039 no

Self Confidence

SelfConABI <19 0.013 0.017 no

SelfConABI 19-‐24 -‐0.063 0.019 yes 0.076 negligible 0.113 negligible

SelfConABI >24 0.05 0.026 no

SelfConAPP <19 0.091 0.018 yes 0.063 negligible 0.21 negligible

SelfConAPP 19-‐24 0.028 0.02 no

SelfConAPP >24 -‐0.119 0.027 yes 0.21 negligible 0.147 negligible

SelfConBELI <19 -‐0.104 0.025 yes 0.139 negligible 0.173 negligible

SelfConBELI 19-‐24 0.035 0.027 no

SelfConBELI >24 0.069 0.037 no *Effect size 1: for <19 = <19 compared with 19-‐24; for 19-‐24= 19 -‐24 compared with <19; for >24= >24 compared with 19-‐24. **Effect size 2: for <19 = < 19 compared with >24; for 19-‐24 = 19-‐24 compared with >24; for >24 = >24 compared with <19. ***Magnitude: negligible= ES<.426; intermediate=.426<ES>.638; large=ES>.638.

One of the primary issues with DIF for items in the LEQ subdomains is that there are

only three items per subdomain. Removing one item out of three will likely affect internal reliability of the TL subscale (EAP is 0.82, which corresponds with Cronbach’s alpha) and standard errors, and, potentially, content coverage. The Task Leadership item showing DIF is the most difficult of the three, but only marginally, so it does not contribute much more to the person-‐item distribution than the other items in the subscale. Considering that this particular item functions differently than the other items for two subgroups and that programs are not likely to analyze or use this item level for targeted programming purposes, the item should probably be replaced. The current version of the TL subscale was derived from an 8-‐item subscale (from the LEQ-‐G), so other items from within the longer Task Leadership scale could readily be considered for inclusion in an updated version of the subscale.

67

Chapter 5: Discussion

Contemporary, professional standards for validity, reliability and fairness can be combined with the technical advantages of modern measurement models in order to appraise validity arguments for the interpretation and use of test scores. Statistical evidence for model fit must be balanced with substantive analyses of data in order to make appropriate evaluations about what test scores can be interpreted to mean and the appropriateness of score use. Wright maps are an important form of output from the Rasch models, and were used to examine the probabilistic relationship between items and responses along the shared scale representing the Life Effectiveness construct. Wright maps produced empirical evidence, in the form of technically calibrated responses to items, which was used to test hypotheses about test content and test structure. This particular form of hypothesis testing focused on the interpretable meaning of scores and not just their statistical properties. This investigation of validity evidence assumed that the developmental process of creating the LEQ was complete, and that the primary concern of the various stakeholders who are using the LEQ is the interpretation of scores. This study therefore constituted an appraisal stage of validation, in which “a neutral or even critical stance” was taken to explore “hidden assumptions and…alternative possible interpretations of the test scores” (Kane, 2006, p.26).

In this study, Rasch item response modeling was applied to examine validity evidence based on test content and internal structure of the LEQ. Likelihood ratio tests and weighted meansquare statistics, as well as the practical effect sizes (i.e., applied use) of these statistics, were used to determine the best-‐fitting model for the LEQ data given the composite, consecutive and multidimensional approaches. Chi-‐square likelihood ratio tests showed that the partial credit model fit better than the rating scale model for the composite and multidimensional models (p>0.05), and also seven of the eight consecutive models (p>0.05). Likelihood ratio tests also showed that the multidimensional model fit better than the composite model or consecutive models. However, weighted meansquare statistics for items and steps fit well for all three approaches. This is a notable finding, as it is often the case that items created for an instrument using a classical test theory approach and confirmatory factor analysis do not fit item response models well (e.g., Hahn, et al., 2010; Egberink & Meijer, 2011.) Another noteworthy finding was that no meaningful DIF was identified in either the composite LEQ or in any of the consecutive LEQ subscales. This indicates that there is no statistical evidence of construct-‐irrelevant variance by gender, age, or voluntary status. The fit of data to the Rasch model and the finding of no DIF provided good initial evidence for validity based on the internal structure of the LEQ

The partial-‐credit, multidimensional model is likely the best approach to modeling life effectiveness because it offers the benefits of eight correlated estimates without losing information from the distinct outcome areas. Nonetheless, substantive analyses of LEQ data in this investigation focused on the appraisal of evidence for the composite and consecutive approaches, because these are the approaches that are currently applied programmatically. The interpretation and use of scores from the LEQ is tightly linked to the particular approach that is used to produce the scores. The composite approach allows for easy comparison of one global Life Effectiveness outcome among programs. However, the composite scale also obscures potentially important program differences and provides no interpretable dimensional

68

information. The consecutive approach theoretically allows programs to pick and choose which constructs to measure, thereby reducing the testing response burden for participants and programs and better targeting evaluation efforts. However, each of the short consecutive scales must be able to stand on its own in terms of validity evidence based on tests content and internal structure.

Validity based on test content was assessed by comparing the hypothesized blueprint for the LEQ with the empirical evidence as presented in Wright maps. The relative distribution of persons to items as shown in Wright maps of LEQ composite and subscale data indicated that LEQ items represent low to moderately low amounts of Life Effectiveness and were easy for respondents to positively endorse. However, the blueprint for LEQ item content was hypothesized to represent optimal, high levels of competency relative to each LEQ subdomain. This finding indicates a mismatch between the intended meaning of item content and empirical item difficulties, so that the test developers’ initial hypothesis about item content is not well supported by the data.

Construct mapping allows a test developer to theorize about content relevance and representativeness for a construct, and to then test this theory through item response modeling. Without an a priori hypothesis about substantive ordering, as is the case with the LEQ, we are precluded from using the Wright map to compare the hypothesized sequencing of items with empirical sequencing. Nonetheless, the empirical evidence provided by the Wright map indicates that there is no order to LEQ test content—whether or not the lack of order was intentional. The lack of ordering of test content raises questions as to the adequacy with which items represent the content domain.

Furthermore, Wright maps provide evidence that LEQ content represents a homogeneous, low level of life effectiveness, as was implicitly hypothesized in the LEQ blueprint. However, if a latent variable is theorized to exist along a linear continuum (e.g., high to low, more to less) and derived scores are expressed in a corresponding continuum, then theoretically one would expect that items should be constituted along a continuum as well. Evidence from the Wright map indicates that items oversample from one, relatively easy level of the Life Effectiveness continuum, and do not adequately represent the multiple levels one would expect to see for a Life Effectiveness construct. This finding is the same for the consecutively analyzed subscales of the LEQ.

When data do not align as intended or if content of the test otherwise “fails to capture important aspects of the construct”, this indicates a potential threat to validity through construct underrepresentation (AERA et al., 1999, p. 10; Messick, 1989). Construct underrepresentation in the LEQ appears to result in a “narrowed meaning of test scores", thereby weakening the argument for the use of the LEQ and the LEQ subscales.

Analyses of internal structure provided evidence about the degree to which “relationships among test items and test components conform[ed] to the construct on which the proposed test score interpretations are based” (AERA et al., 1999, p. 13). Evidence from Wright maps indicated that respondents did not interpret and differentiate between response categories equivalently within items or across subdomains. This indicated that the assumptions of interval, Likert scaling that are the basis of summed and averaged scores for the LEQ are not supported. The large quantity of numbered, unlabeled response categories did not lend themselves to meaningful or consistent interpretation, as evidenced by disordered average

69

mean values, which were likely due to the very low frequency counts in the lowest response categories. Over-‐categorization appears to have occurred with the LEQ, suggesting an overestimation of reliability coefficients and an underestimation of standard errors.

Wright maps provided evidence of good respondent-‐to-‐item threshold distribution for the composite LEQ. However, item thresholds are skewed to the lower levels of life effectiveness, so that standard errors are higher for respondents with higher ability estimates. The item threshold coverage was adequate for the subscales, but because of the small number of items for each scale (and despite the large number of response categories) there is evidence of gaps and sparse item threshold coverage for respondents with high life effectiveness estimates. Despite the reasonable reliability coefficients for the subscales, there was considerable uncertainty and standard errors associated with the person ability estimates for all subscales, due to the limited number of item thresholds contributing to estimations of respondents’ locations. The Standards (1999) suggest that if interpretation of scores is based on small subsets of items, users of such subtests should be provided with “guidance to enable them to judge the degree of confidence warranted” and “score reports should discourage overinterpretation of information that may be subject to considerable error” (AERA et al., 1999, p. 19). The sizable standard errors and large confidence intervals associated with the LEQ subscales, in particular, as well as for high ability respondents in the composite LEQ, caution us about being over-‐confident in using student ability estimates to make claims about change.

Returning to the design principles of the LEQ, perhaps the most important question to be answered with respect to the validity evidence is whether the meaning of the scores is interpretable enough to be used for evaluating program outcomes and for facilitating personal reflection. The Rasch analysis in this appraisal validity study reveal that the generalized and homogeneous structures of the overall Life Effectiveness construct and each of the eight subdomains provide weak evidence in support of instrument content. What evidence does exist gives very coarsely grained information about the theoretical construct, about respondents, about the meaning of test scores and how those scores should be interpreted, particularly as a change score. That is, neither the individual response categories nor the total life effectiveness sum these scores reveal much about the types of life competencies that respondents possess. We don’t know with any degree of certainty what a person can or can’t do in terms of social competence, time management, intellectual flexibility, etc.

In practical use terms, even problems with content validity or internal structure are ignored or trivialized, this lack of interpretability means that overall scores cannot tell programs or participants what specific construct-‐related skills or behaviors respondents already do well, and what skills or behaviors need improvement. For instance, if a group of students have a moderately high Time Effectiveness score at Time 1, say a 5.5, and improve at Time 2 by 1 point, to a 6.5, this may indicate a statistically significant change of a moderate effect size. But what do these statistics tell us about what students learned on course substantively? How do the statistics inform teacher and program practices? The LEQ instrument offers little information.

In sharing his original data (Neill, 2008) that made this study possible, Neill encouraged analyses and reporting of findings that would improve the quality and development of instruments that are used to measure change in experiential education programs (2007, personal communication). Although there is nothing inherently wrong with the approach that

70

is taken to constructing the LEQ, the Rasch analysis in this paper highlights some conceptual and technical limitations to the approach. Perhaps the biggest strength of the LEQ is the theoretical identification of subdomains that contribute to Life Effectiveness construct. This theoretical work is likely a good starting point for moving forward in new instrument development considering, as an example, the subdomain of social competence.

Social competence represents one of many so-‐called “new constructs,” that are based on personality, attitudes and beliefs-‐-‐rather than cognitive abilities—and that are theorized to be important mediators of achievement in school and the workplace (Kyllonen, 2005; Cunha & Heckman, 2010). These new constructs are clearly of interest not just to experiential education programs, but also to institutions of higher education, the military, and employers. Although economists’ and industry interests in new constructs may be predicated on financial motives, experiential education programs are more often inspired by frameworks of social justice and positive youth development (Sammet, 2010; Sibthorp & Morgan, 2011).

In order to gain traction in experiential education programs, new measures must have value added if they are to replace traditional self-‐report instruments that are so easy to use, despite questionable validity (Hurst, 2012, personal communication). Although experiential education programs are as likely interested in the predictive value of new constructs as profit-‐driven institutions, education programs are also interested in measures of new constructs that can be used to improve learning and skills, and to provide immediate evidence of program effectiveness. Measures of new constructs, such as social competence, that are useful for purposes of assessing learning and for accountability in experiential education should be based on three central attributes: (1) a developmental perspective that incorporates learning progressions; (2) a close connection between the curriculum and the assessment system and; (3) high technical qualities of validity, reliability and fairness (Pellegrino at al., 2001; Wilson, 2005a). These attributes integrate recommendations from the National Research Council on good assessment (Pelligrino et al., 2001), and practices from the Berkeley Evaluation and Assessment Research Assessment System (BAS; Wilson, 2005; Wilson & Sloane, 2000). New measures must also do a better job of capturing outcomes in experiential programs that have often been identified using qualitative but not quantitative research methods.

The initial step in designing a new social competence measure for experiential education programs is creating a map of relevant learning progressions, or developmental levels of the construct, that are grounded in the appropriate literature. It is essential that items in a measure link to all levels of social competence development that are identified in this social competence construct map. These learning progressions can then be elaborated in the instrument with examples of observable behaviors that might be exhibited by a participant in an experiential education program. Response processes should be examined to provide evidence for how respondents substantively interpret item content and use the response options, and this evidence should inform ongoing instrument development.

Second, measures of social competence can be integrated into the curriculum and on-‐going program practices. Measures can move beyond artificial testing events that ask participants to self-‐report on their actions or abilities. Instead, practitioners can be trained to observe and collect evidence of participant behaviors and learning (e.g., using quotes, drawings, writings, notations of actions) that occur during regular program activities. Practitioners can be further trained to reflect on their observations and to make nuanced

71

evaluations of participants, based on sequential, developmental landmarks of what participants can do and what they know about social competence. These observational evaluations by program staff can take place in addition to or in place of self-‐reports. Because observational assessments can be made during regular program activities, outcomes of the assessments can be used immediately to inform on-‐going instruction. Post-‐intervention, results of individual assessments can be aggregated to support continuous program improvement.

Finally, the technical quality of social competence measures can be evaluated by applying modern measurement models and considering the sources of evidence for validity, reliability and fairness that are established by the Standards (AERA, AEA & NCME, 1999). Generalized item response models facilitate testing of hypothesis about the theorized structure of a construct, and provide useful information that supports inferences from the data that cannot be made through numerical score averages and other traditional summary statistical procedures.

An example of this type of measure can be found in the Desired Results Developmental Profile School Age (2011) Complete Version (DRDP-‐SA; CDE, 2011a), which was constructed based on the principles of the BEAR (Berkeley Evaluation and Assessment Research) Assessment System (BAS; Wilson & Sloane, 2000; Wilson, 2005). The DRDP-‐SA is a strengths-‐based measure created to assess the positive cognitive, socio-‐emotional, language and physical development of youth who participate in before-‐ and after-‐school programs funded by the California Department of Education (CDE). The DRDP-‐SA is designed to allow for flexibility in the structure and objectives of individual youth development programs, while remaining sensitive to the economic, linguistic, and cultural diversity of youth and families that such programs serve (CDE, 2011b). Application of the BAS principles, as exemplified by the DRDP-‐SA, provides evidence of how assessments that are both developmentally appropriate and of high technical quality can be developed to support instructional and accountability purposes in youth development interventions (Sammet, Moore & Wilson, 2012).

72

References

Adams, R. J., Wilson, M. R., & Wang, W. C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1–23.

Adams, R.J., Wu, M. L. & Wilson, M. (2012). The Rasch Rating Model and the Disordered Threshold Controversy. Educational and Psychological Measurement. First published on February 17, 2012, doi:10.1177/0013164411432166.

Allen, D. D. (2007). Validity and reliability of the Movement Ability Measure: A self-‐report instrument proposed for assessing movement across diagnoses and ability levels. Physical Therapy, 87(7), 899-‐916.

Allen, D. D. (2010). Using item response modeling methods to test theory related to human performance. Journal of Applied Measurement, 11(2), 99-‐111.

Allen, D. D., & Wilson, M. (2006). Introducing multidimensional item response modeling in health behavior and health education research. Health Education Research, 21(Suppl 1), i73-‐84.

American Educational Research Association, American Psychological Association & National Council on Measurement in Education (1999). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association.

American Institutes for Research. (2005). Effects of adventure education programs for children in California. Palo Alto, CA: American Institutes for Research.

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43(4), 561-‐573.

Australia Outward Bound, 2010. Information retrieved from: http://www.outwardbound.org.au/about-‐us/research/12-‐leq-‐life-‐effectiveness-‐questionnaire.html.

Bagley AM, Gorton GE, Bjornson K, et al. Factor-‐ and item–level analysis of the 38-‐item activities scale for kids-‐performance. Dev Med Child Neurol 2011; 53: 161–6.

Bandy, T., Burkehauser, M., Metz, A. (2009). Data-‐driven decision making in out-‐of-‐school time programs. Child Trends, Publication #2009-‐34. Accessed from: http://www.childtrends.org/Files/Child_Trends-‐2009_06_23_RB_Decision-‐Support.pdf.

Beach, P. A. (1997). A statistical analysis of the validity and usefulness of the leisure deficits scale in differentiating inpatient psychiatric patients from a nonclinical sample in three key areas of leisure functioning. Dissertation Abstracts International: Section B: The Sciences and Engineering, 58(3-‐B), pp.1588.

Behrens, T.R., & Kelly, T. (2008). Paying the piper: Foundation evaluation capacity calls the tune. In J. G. Carman & K. A. Fredericks (Eds.), Nonprofits and evaluation . New Directions for Evaluation, 119, 35-‐70.

Blanton, H. & Jaccard, J. (2006). Arbitrary metrics in psychology. American Psychologist, 61(1), 27-‐41. Bond, T.V. & Fox, C.M. (2007). Applying the Rasch Model: Fundamental Measurement in the Human

Sciences. New York: Taylor & Francis Group. Bornstein, R. F. (2011). Toward a process-‐focused model of test score validity: Improving psychological

assessment in science and practice. Psychological Assessment, 23(2), 532-‐544. Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, (71)3, 425-‐440. Borsboom, D. (2008). Latent variable theory. Measurement, 6, 25-‐53.

73

Borsboom, D., Mellenberg, G. J., & Van Heerden, J. (2004). The concept of validity. Psychological Review, 111,1061-‐1071.

Briggs, D. C. & Wilson, M. (2003). An introduction to multidimensional measurement using Rasch models. Journal of Applied Measurement, (4)1, 87-‐100.

Brosi, E. (2011). Measurement Tools for Evaluating Out-‐of-‐School Time Programs: An Evaluation Resource. Harvard Family Research Project. Accessed November 1, 2011 from http://www.hfrp.org/out-‐of-‐school-‐time/publications-‐resources/measurement-‐tools-‐for-‐evaluating-‐out-‐of-‐school-‐time-‐programs-‐an-‐evaluation-‐resource2.

Brown, N. J. S., & Wilson, M. (2011). A model of cognition: The missing cornerstone of assessment. Educational Psychology Review, 23(2), 221-‐234.

California Department of Education (2011a). Desired Results Developmental Profile-‐School Age (2011) Complete Version. Retrieved from: http://www.cde.ca.gov/sp/cd/ci/drdpforms.asp

California Department of Education (2011b). Introduction to Desired Results. Retrieved from: http://www.cde.ca.gov/sp/cd/ci/desiredresults.asp.

Camilli G. & Shepard L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage. Carifio, J. & Perla, R.J. (2007). Ten common misunderstandings, misconceptions, persistent myths and

urban legends about Likert scales and Likert response formats and their antidotes. Journal of Social Sciences, 3(3): 106-‐116.

Cizek, G. J., Rosenberg, S. L., & Koons, H. H. (2008). Sources of validity evidence for educational and psychological tests. Educational and Psychological Measurement, 68(3), 397-‐412.

Cizik, G.J. (2012). Defining and Distinguishing Validity: Interpretations of Score Meaning and Justifications of Test Use. Psychological Methods 17(1), 31-‐43.

Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17(1), 31-‐44.

Combs, A. W. (1999). Being and becoming: A field approach to psychology. New York, Springer, as cited in Neill, J. (2008). Enhancing Life Effectiveness: The Impacts of Adventure education Programs. Unpublished Doctoral Thesis. University of Western Sydney.

Cronbach, L. J. (1971). Test validation. In R.L. Thorndike (Ed.), Educational Measurement (2nd ed., pp. 443-‐507). Washington, DC: American Council on Education.

Cronbach, L.J. (1988). Five perspectives on validity argument. In H. Wainer & H. I. Braun (Eds.), Test Validity (pp. 3-‐17). Hillsdale, NJ: Erlbaum.

Cronbach, L.J., & Meehl, P.E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302.

Cunha, F. & Heckman, J.J. (2010). Investing in our young people. National Bureau of Economic Research Working Paper 16201. Retrieved from http://www.nber.org/papers/w16201.

DeVellis, R.F. (2003) Scale Development: Theory and Applications (2nd ed.). Thousand Oaks, CA: Sage. DeVellis, R.F. (2006). Classical test theory. Medical Care, 44(11, Suppl 3), S50-‐S59. DeVellis, R.F. (2011) Scale Development: Theory and Applications (3rd ed). Thousand Oaks, CA: Sage. Egberink, I.J.L. & Meijer, R.R. (2011). An item response theory analysis of Harter’s Self-‐Perception

Profile for Children or why strong clinical scales should be distrusted. Assessment, 18(2), 201-‐212.

Eid, M., & Zickar, M. J. (2007). Detecting response styles and faking in personality and organizational assessments by mixed rasch models. In M. von Davier & C. H. Carstensen (Eds.), (pp. 255-‐270). New York: Springer Science + Business Media.

74

Embretson, S. E. (1996). The new rules of measurement. Psychological Assessment, 8, 341–349. Fixsen, D. L., Naoom, S. F., Blasé, K., Friedman, R. M., & Wallace, F. (2005). Implementation research: A

synthesis of the literature. National Implementation Research Network, University of South Florida, Louis de la Parte Florida Mental Health Institute. Available online at http://nirn.fmhi.usf.edu/resources/.

Gass, M. A. (2005). Comprehending the value structures influencing significance and power behind experiential education research. Journal of Experiential Education, 27(3), 286-‐296.

Gnambs, T., & Batinic, B. (2011). Evaluation of measurement precision with Rasch-‐type models: The case of the short generalized opinion leadership scale. Personality and Individual Differences, 50(1), 53-‐58.

Gray-‐Little, B., Williams, VS, & Hancock, TD (1997). An item response theory analysis of the Rosenberg Self-‐Esteem Scale. Personality & Social Psychology Bulletin, 23, 443-‐451.

Goodwin, L. D., & Leech, N. L. (2003). The meaning of validity in the new standards for educational and psychological testing: Implications for measurement courses. Measurement and Evaluation in Counseling and Development, 36(3), 181-‐192.

Government Performance and Results Act (GPRA) (1993). Guion, R. (1980). On Trinitarian doctrines of validity. Professional Psychology, 11(3), 385-‐398. Hahn, E. A., DeVellis, R. F., Bode, R. K., Garcia, S. F., Castel, L. D., Eisen, S. V., . . . The PROMIS

Cooperative Group. (2010). Measuring social health in the patient-‐reported outcomes measurement information system (PROMIS): Item bank development and testing. Quality of Life Research: An International Journal of Quality of Life Aspects of Treatment, Care & Rehabilitation, 19(7), 1035-‐1044.

Hattie, J., Marsh, H. W., Neill, J. T., & Richards, G. E. (1997). Adventure education and outward bound: Out-‐of-‐class experiences that make a lasting difference. Review of Educational Research, 67(1), 43-‐87.

Heesch, K. C., Mâsse, L. C., & Dunn, A. L. (2006). Using Rasch modeling to re-‐evaluate three scales related to physical activity: Enjoyment, perceived benefits and perceived barriers. Health Education Research, 21(Suppl 1), 58-‐72.

Hogan, T. P., & Agnello, J. (2004). An empirical study of reporting practices concerning measurement validity. Educational and Psychological Measurement, 64(5), 802-‐812.

Holland, P. W., & Wainer, H. (1993). Differential item functioning. Hillsdale, NJ: Lawrence. Hood, S. B. (2009). Validity in psychological testing and scientific realism. Theory & Psychology, 19,

451–473. Johnstone, C. J., Thompson, S. J., Bottsford-‐Miller, N. A., & Thurlow, M. L. (2008). Universal design and

multimethod approaches to item review. Educational Measurement: Issues and Practice, 27, 25-‐36

Joint Committee on Standards for Educational Evaluation (1994). The Program Evaluation Standards. Thousand Oaks, CA: Sage.

Kahn, J., Bronte-‐Tinkew, B.A., & Theokas, C. (2008) How can I assess the quality of my program? Tools for Out-‐of-‐School Time Program practioners. (Research-‐to-‐Results Brief). Washington D.C.: Child Trends. Accessed from http://www.childtrends.org/files/child_trends-‐2008_02_19_eval8programquality.pdf.

Kane, M. (1992). An argument-‐based approach to validity. Psychological Bulletin, 112(3), 527-‐535.

75

Kane, M. (2001). Current Concerns in Validity Theory. Journal of Educational Measurement, 38(4), 319-‐342.

Kane, M. (2006). Validation. In R. Brennan (Ed.), Educational Measurement (4th ed., pp. 17-‐64). Westport, CT: American Council on Education and Praeger.

Kane, M. (2009). Validating the interpretations and uses of test scores. In R. W. Lissitz (Ed.), The concept of validity: Revisions, new directions and applications (pp. 39-‐64). Greenwich, CT, US: IAP Information Age Publishing.

Kane, M. (2010). Errors of measurement, theory, and public policy. William H. Angoff memorial lecture series. Educational Testing Service. Princeton, NJ.

Karelitz, T.M., Parrish, D. M., Yamada, H., Wilson, M. (2010). Articulating assessments across childhood: The cross-‐age validity of the Desired Results Developmental Profile-‐Revised. Educational Assessment, 15, 1-‐26.

Kyllonen, P.C. (2005). The case for noncognitive assessments. R&D Connections. Princeton, NJ: ETS. Linacre, J.M. (1999). Category disordering (disordered categories) vs. Threshold Disordering

(disordered thresholds). Rasch Measurement Transactions, 13(1), 675. Linacre, J.M., (2002). Optimizing rating scale category effectiveness. Journal of Applied Measurement,

3,(1), 85-‐106. Linacre, J.M. (2012). Winsteps Rasch Tutorial 2. Retrieved from http://www.winsteps.com/a/

winsteps-‐tutorial-‐2.pdf. Lissitz, R. W. (2009). Introduction. In R. W. Lissitz (Ed.), The Concept of Validity: Revisions, New

Directions and Applications (pp. 1-‐15). Greenwich, CT, US: IAP Information Age Publishing. Lissitz & Samuelson (2007). A suggested change in terminology and emphasis regarding validity and

education. Educational Researcher, 36, 437-‐48. Longford, N. T., & Holland, P. W., & Thayer, D. T. (1993). Stability of the MH D-‐DIF statistics across

populations. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp.171-‐196). Hillsdale, NJ: Lawrence Erlbaum.

Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 635-‐694 (Pt.9).

Lucas, S. R. (2008). Theorizing Discrimination in an Era of Contested Prejudice: Discrimination in the United States. Philadelphia, PA: Temple University Press.

Marsh, H.W., Richards, G. E., & Barnes, J. (1984). Multidimensional self-‐concepts: The effect of participation in an Outward Bound program. Journal of Personality and Social Psychology, 50(1), 195-‐204.

Marsh, H. W., & Richards, G. E. (1990). Self-‐other agreement and self-‐other differences on multidimensional self-‐concept ratings.Australian Journal of Psychology, 42(1), 31-‐45.

Mâsse, L.C., Dassa, C., Gauvin, L., et al. (2002) Emerging measurement and statistical methods in physical activity research. American Journal of Preventative Medicine, 23, 44-‐55.

Masters, G. (1982). A Rasch model for partial credit scoring. Psychometrika 47(2), 149–174. Meakins, C. R. H., Bundy, A. C., & Gliner, J. (2005). Validity and reliability of The Experience of Leisure

Scales (TELS). In McMahon F. F., Lytle D. E. and Sutton-‐Smith B. (Eds.), Play: An interdisciplinary synthesis (pp. 255-‐278). Lanham, MD, US: University Press of America.

Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (3rd ed.) (pp.. 13–103). Washington, DC: American Council on Education and National Council on Measurement in Education.

76

Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741-‐749.

Mislevy, R. J., Beaton, A. E., Kaplan, B., and Sheehan, K. (1992). Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement,29(2), 133-‐161.

Mislevy, R.J. (2009) Validity from the perspective of model-‐based reasoning. In R.W. Lissitz (Ed.), The Concept of Validity: Revisions, New Directions and Applications (pp. 83-‐108). Charlotte, NC: Information Age Publishing.

Morizot, J.M., Ainsworth, A.T., & Reise, S.P. (2007). Towards modern psychometrics: Application of item response theory models in personality research. In R.W. Robins, R.C., Fraley, & R.F.Krueger(Eds.) Handbook of Research Methods in Personality Psychology (pp. 407-‐423). NY, NY: Guilford Press

National Research Council, (2002). Community Programs to Promote Youth Development. Washington, DC: The National Academies Press, 2002.

Neill, J. (2007). Analyzing Your Life Effectiveness Questionnaire Data. Accessed from: http://wilderdom.com/tools/leq/LEQAnalyzing.html.

Neill, J. (2008). Enhancing Life Effectiveness: The Impacts of Adventure education Programs. Unpublished Doctoral Thesis. University of Western Sydney.

Neill, J. (2008b). Why Use Effect Sizes Instead of Significance Testing in Program Evaluation. Accessed from: http://wilderdom.com/research/effectsizes.html.

Neill, J. T., Marsh, H. W., & Richards, G. E. (2003). The Life Effectiveness Questionnaire: Development and psychometrics. Unpublished manuscript, University of Western Sydney, Sydney, NSW, Australia.

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-‐Hill. Paek, I. (2002). Investigation of differential item functioning: Comparisons among approaches, and

extension to a multidimensional context. Unpublished doctoral dissertation. University of California, Berkeley.

Paek, I., & Wilson, M. (2011). Formulating the Rasch differential item functioning model under the marginal maximum likelihood estimation context and its comparison with Mantel-‐Haenszel procedure in short test and small sample conditions. Educational and Psychological Measurement, 71(6), 1023-‐1046.

Pellegrino, J., Chudowsky, N., & Glaser, R. (Eds.). (2001). Knowing what students know: The science and design of educational assessment. Washington, DC: National Academies Press.

Popham, J. (1997). Consequential validity: Right concern—wrong concept. Educational Measurement: Issues and Practice, 16(2), 9–13.

Purdie, Neill & Richards (2002). Australian identity and the effect of an outdoor education program. Australian Journal of Psychology, 54(1), 32-‐39.

Rasch, G. (1960, 1980). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark: Danish Institute for Educational Research.

Raudenbush, S. W. (2005). Learning from attempts to improve schooling: The contribution of methodological diversity. Educational Researcher, 34(5), 25-‐31.

Reise, S. P. & Moore, T. M. (2012). An introduction to item response theory models and their application in the assessment of noncognitive traits. In H. Cooper (Ed.-‐in-‐Chief) et al., APA

77

Handbook of Research Methods in Psychology: Vol. 1. Foundations, Planning, Measures, and Psychometrics (pp. 699-‐721). Washington, D.C.: American Psychological Association.

Rose-‐Krasnor, L (1997). The nature of social competence: A theoretical review. Social Development, 6(1), 111-‐135.

Sammet, K. (2010). Relationships matter: Adolescent girls and relational development in adventure education. Journal of Experiential Education, 33(2), 151-‐165.

Sammet, K., Moore, S., & Wilson, M. (2012). Measuring Positive Development of Youth in Context: The Design and Validation of an Embedded Assessment System for Out-‐of-‐School Time Programs. American Educational Research Association 2013 Paper Presentation Proposal.

Santelices, M.V. & Wilson, M. (2010). Unfair treatment? The case of Freedle, the SAT and the standardization approach to differential item functioning. Harvard Educational Review 80(1), 106-‐134.

Shepard, L. A. (1993). Evaluating test validity. Review of Research in Education, 19, 405-‐450. Sibthorp, J. (2000). Measuring weather... and adventure education: Exploring the instruments of

adventure education research. Journal of Experiential Education, 23(2), 99-‐107. Sibthorp, J., & Arthur-‐Banning, S. (2004). Developing life effectiveness through adventure education:

The roles of participant expectations, perceptions of empowerment, and learning relevance. Journal of Experiential Education, 27(1), 32-‐50.

Sibthorp, J. and Morgan, C. (2011), Adventure-‐based programming: Exemplary youth development practice. New Directions for Youth Development, 2011: 105–119.Sireci, S.G. (1998a) The Construct of Content Validity. Social Indicators Research, 45: 83-‐117.

Sireci, S.G. (1998b) Gathering and analyzing content validity data. Educational Assessment, 5(4), 299-‐321.

Sireci, S. G., & Parker, P. (2006). Validity on trial: Psychometric and legal conceptualizations of validity. Educational Measurement: Issues and Practice, 25(3), 27-‐34.

Slaney, K.L., Tkatchouk, M., Gabriel, S., Mauraun, M. (2009). Psychometric Assessment and Reporting Practices. Journal of Psychoeducational Assessment, 27(6), 465-‐476.

Spector, P.E. (1992). Summated Rating Scale Construction: An Introduction (Sage University Paper series on Quantitative Applications in the Social Sciences, No. 82). Newbury Park, CA: Sage.

Sternberg, R. J., Wagner, R.K., Williams, W.M., & Horvath, J.A. (1995). Testing common sense. Cited in Neill, J. (2008). Enhancing Life Effectiveness: The Impacts of Adventure education Programs. Unpublished Doctoral Thesis. University of Western Sydney.

Stone, M.H., Wright, B.D., Stenner, A. J. (1999). Mapping variables. Journal of Outcome Measurement 3(4), 308-‐322.

Sudman, S., Bradburn, N., & Schwartz, N. (1996). Thinking about answers: The application of cognitive processes to survey methodology. San Francisco: Jossey-‐Bass.

Tenenbaum, G., Strauss, B., & Büsch, D. (2007). Applications of generalized Rasch models in the sport, exercise, and the motor domains. In M. von Davier, C. H. Carstensen, M. von Davier & C. H. Carstensen (Eds.), Multivariate and mixture distribution Rasch models: Extensions and applications. (pp. 347-‐356). New York: Springer Science + Business Media.

Tournangeau, R., Rips, L.J., & Rasinkski, K. (2000). The psychology of survey response. New York: Cambridge University Press.

United Way of America. (1996). Measuring program outcomes: A practical approach. Alexandria, VA: United Way of America.

78

van Alphen, A., Halfens, R., Hasman, A., & Imbos, T. (1994). Likert or Rasch? Nothing is more applicable than good theory. Journal of Advanced Nursing, 20, 196-‐201.

W.K. Kellogg Foundation. (1998). W.K. Kellogg Foundation Evaluation Handbook. Battle Creek, MI: W.K. Kellogg Foundation

Walker, K.E., Farley, C., Polin, M. (2012). Using Data in Multi-‐Agency Collaborations: Guiding Performance to Ensure Accountability and Improve Programs. Childtrends. Accessed from: http://www.childtrends.org/Files//Child_Trends-‐2012_02_23_FR_UsingData.pdf

Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika (53)3: 427-‐450.

Wang. C. K. J., Neill, J. T., Liu, W. C., Tan, O. S., Koh, C., & Ee, J. (2008). Project Work and Life Skills: Psychometric Properties of the Life Effectiveness Questionnaire for Project Work. Educational Research Journal, 21,(1), 23-‐41.

Watson, K., Baranowski, T. & Thompson, D. (2006). Item response modeling: an evaluation of the children's fruit and vegetable self-‐efficacy questionnaire. Health Education Research (21)(Supplement1), pi47-‐5.

Waugh, R.F. (2001). Measuring Ideal and Real Self-‐Concept on the Same Scale, Based on a Multifacted, Hierarchical Model of Self-‐Concept. Educational and Psychological Measurement, 61(1), 85-‐101.

Weng, L-‐J. (2004). Impact of the number of response categories and anchor labels on coefficient alpha and test-‐retest reliability. Educational and Psychological Measurement, 64(6), 956-‐972.

Wilson, M. (2003). On choosing a model for measuring. Methods of Psychological Research -‐ Online, 8(3), 1-‐22. Accessed from: http://www.dgps.de/fachgruppen/methoden/mpr-‐online/issue21/mpr122_8.pdf.

Wilson, M. (2005a). Constructing Measures: An Item Response Modeling Approach. Mahwah, NJ: Erlbaum.

Wilson, M. (2005b). Subscales and summary scales: Issues in health-‐related outcomes. In J. Lipscomb, C. C. Gotay & C. Snyder (Eds.), Outcomes Assessment in Cancer: Measures, Methods, and Applications, (pp. 465-‐479). New York: Cambridge University Press.

Wilson, M., Allen, D. D., & Li, J. C. (2006). Improving measurement in health education and health behavior research using item response modeling: Comparison with the classical test theory approach. Health Education Research, 21(Suppl 1), 19-‐32.

Wilson, M. & Sloane, K. (2000). From principles to practice: An embedded assessment system. Applied Measurement in Education, 13(2), 181-‐208.

Wright, B. & Masters, G. (1982). Rating Scale Analysis. Chicago: MESA Press. Wright, B. D. & Linacre, J. M. (1992). Combining and splitting of categories. Rasch Measurement

Transactions, 8(3), 370. Wright, B. D., & Panchapakesan, N. (1969) A procedure for sample-‐free item analysis. Educational and

Psychological Measurement, 29, 23-‐48. Retrieved from www.rasch.org/memo46.htm. Wu, M. & Adams, R. (2007). Applying the Rasch model to psycho-‐social measurement: A practical

approach. Educational Measurement Solutions, Melbourne. Wu M.L., Adams, R.J., & Wilson, M. (1998). ACER ConQuest: Generalised Item Response Modelling

Software [computer program]. Hawthorn, Australia: ACER. Wu, M. L., Adams, R. J., Wilson, M.R., & Haldane, S. A. (2007). ACER ConQuest Version 2: Generalized

item response modeling software. Camberwell: Australian Council for Educational Research.

79

Yohalem, N., Wilson-‐Ahlstrom, A., with Fischer, S., & Shinn, M. (2007, March). Measuring Youth Program Quality: A Guide to Assessment Tools. The Forum for Youth Investment, Impact Strategies, Inc.: Washington, D.C.

Zenisky, A. L., Hambleton, R. K., & Robin, F. (2004). DIF detection and interpretation in large-‐scale science assessments: Informing item writing practices. Educational Assessment, 9(1-‐2), 61-‐78.

80

APPENDIX A

81

FigureA1. LEQ Instrument.

82

Figure A2. LEQ Instructions.

83

APPENDIX B

84

================================================================================

Active Initiative MAP OF WLE ESTIMATES AND RESPONSE MODEL PARAMETER ESTIMATES ================================================================================ Terms in the Model (excl Step terms) +item ---------------------------------------------------- 8 | | | | | | | | 7 | | | | | | 6 | | | | | | | | 5 | | | | | | | | 4 | | | | XXXXXXXXXX| | | | 3 | | | | | | 2 XXXXX| | | | | | XXXXXX| | 1 | | XXXXXXXXXX| | | | | | 0 XXXXXXX| | | | XXXXXXX| | -1 | | XXXXXXXXX| | | | XXXXX| | -2 XXXX| | | | XXX|1 2 3 | XX| | -3 X| | X| | | | -4 X| | X| | | | | | -5 | | | | | | | | -6 | | | | | | -7 | | | | | | | | -8 | | ==================================================== Each 'X' represents 48.7 cases

ACTIVE INITIATIVE RELIABILITY COEFFICIENTS MLE Person separation RELIABILITY: 0.783 WLE Person separation RELIABILITY: 0.793 EAP/PV RELIABILITY: 0.845

Figure B1. Wright Map of Active Initiative by Item.

85

================================================================================

Active Initiative MAP OF WLE ESTIMATES AND THRESHOLDS ================================================================================ Generalised-Item Thresholds -------------------------------------------------------------------------------- 8 | | | | 7 | | | 6 | | | | 5 | | | | 4 | | XXXXXXXXXX| | 3 | | | 2 XXXXX| |1.7 3.7 |2.7 XXXXXX| 1 | XXXXXXXXXX| | | 0 XXXXXXX|3.6 |1.6 2.6 XXXXXXX| -1 | XXXXXXXXX| | XXXXX|3.5 -2 XXXX|1.5 2.5 | XXX| XX|3.4 -3 X|1.4 2.4 X| | -4 X|1.3 2.3 3.3 X| | |1.2 2.2 3.2 -5 | | | |1.1 2.1 3.1 -6 | | | -7 | | | | -8 | ================================================================================ Each 'X' represents 48.7 cases The labels for thresholds show the levels of item, and step, respectively ================================================================================

Figure B2. Wright Map of Active Initiative by Item Threshold.

86

================================================================================

Achievement Motivation MAP OF WLE ESTIMATES AND RESPONSE MODEL PARAMETER ESTIMATES ================================================================================ Terms in the Model (excl Step terms) +item ---------------------------------------------------- | | | | | | 6 | | | | | | | | 5 | | | | | | | | | | 4 | | | | XXXXXX| | | | 3 | | | | | | | | XXXXX| | 2 | | | | | | XXXXXX| | 1 | | | | XXXXXXXXXX| | | | 0 | | XXXXXXXX| | | | | | XXXXXXX| | -1 | | XXXXXXX| | | | | | -2 XXXX| | XXX|1 | | | XX|2 3 | X| | -3 X| | | | X| | | | -4 | | | | | | | | | | -5 | | | | | | | | -6 | | | | | | | | ==================================================== Each 'X' represents 58.7 cases ==================================================== Achievement Motivation RELIABILITY COEFFICIENTS MLE Person separation RELIABILITY: 0.739 WLE Person separation RELIABILITY: 0.745 EAP/PV RELIABILITY: 0.795

Figure B3. Wright Map of Achievement Motivation by Item.

87

================================================================================

Achievement Motivation MAP OF WLE ESTIMATES AND THRESHOLDS ================================================================================ Generalised-Item Thresholds -------------------------------------------------------------------------------- | | | 6 | | | | 5 | | | | | 4 | | XXXXXX| | 3 | | | | XXXXX|1.7 2 | | |2.7 3.7 XXXXXX| 1 | | XXXXXXXXXX| | 0 | XXXXXXXX|1.6 | |2.6 3.6 XXXXXXX| -1 | XXXXXXX| | |1.5 -2 XXXX| XXX|2.5 3.5 | XX|1.4 X| -3 X|3.4 |2.4 X|1.3 | -4 |2.3 3.3 |1.2 | |2.2 3.2 | -5 |3.1 |1.1 2.1 | | -6 | | | | ================================================================================ Each 'X' represents 58.7 cases The labels for thresholds show the levels of item, and step, respectively

Figure B4. Wright Map of Achievement Motivation by Threshold.

88

================================================================================

Emotional Control MAP OF WLE ESTIMATES AND RESPONSE MODEL PARAMETER ESTIMATES ================================================================================ Terms in the Model (excl Step terms)+item ---------------------------------------------------- 8 | | | | | | | | 7 | | | | | | 6 | | | | | | | | 5 | | XXXX| | | | | | 4 | | | | XXX| | 3 | | | | XXXX| | | | 2 | | XXXXXXXXX| | | | | | 1 XXXXXXX| | | | XXXXXXX| | 0 | | XXXXXXXXXX| | | | | | -1 XXXXXXX| | XXXXXX| | | | XXXXX|1 2 3 | -2 | | XXX| | XXX| | XX| | -3 | | X| | X| | -4 X| | X| | | | | | -5 | | | | | | | | -6 | | | | | | -7 | | | | | | | | -8 | | ==================================================== Each 'X' represents 49.2 cases ==================================================== EMOTIONAL CONTROL RELIABILITY COEFFICIENTS MLE Person separation RELIABILITY: 0.843 WLE Person separation RELIABILITY: 0.846 EAP/PV RELIABILITY: 0.859 ------------------------

Figure B5. Wright Map of Emotional Control by Item.

89

================================================================================

Emotional Control MAP OF WLE ESTIMATES AND THRESHOLDS ================================================================================ Generalised-Item Thresholds -------------------------------------------------------------------------------- 8 | | | | 7 | | | 6 | | | | 5 | XXXX| | | 4 | | XXX| 3 |2.7 3.7 | XXXX|1.7 | 2 | XXXXXXXXX| | | 1 XXXXXXX|2.6 |1.6 3.6 XXXXXXX| 0 | XXXXXXXXXX| | | -1 XXXXXXX|1.5 XXXXXX|2.5 3.5 | XXXXX| -2 | XXX|1.4 3.4 XXX|2.4 XX| -3 |1.3 X|3.3 X|2.3 -4 X| X|1.2 2.2 3.2 | | -5 |1.1 |2.1 3.1 | | -6 | | | -7 | | | | -8 | ================================================================================ Each 'X' represents 49.2 cases The labels for thresholds show the levels of item, and step, respectively ================================================================================

Figure B6. Wright Map of Emotional Control by Threshold.

90

================================================================================

Intellectual Flexibility MAP OF WLE ESTIMATES AND RESPONSE MODEL PARAMETER ESTIMATES ================================================================================ Terms in the Model (excl Step terms)+item ---------------------------------------------------- | | | | | | 5 | | | | | | | | | | | | 4 | | XXXX| | | | | | | | 3 | | | | | | XXXX| | | | 2 | | | | | | XXXXX| | | | 1 | | XXXXXXXXXX| | | | | | XXXXXXXXX| | | | 0 | | XXXXXXXX| | | | | | XXXXXXXXX| | -1 | | | | XXXXX| | | | XXXX|1 | -2 |3 | XXX| | X| | |2 | X| | -3 X| | | | | | | | | | -4 | | | | | | | | | | | | -5 | | | | | | | | ==================================================== Each 'X' represents 57.3 cases

INTELLECTUAL FLEXIBILITY RELIABILITY COEFFICIENTS MLE Person separation RELIABILITY: 0.732 WLE Person separation RELIABILITY: 0.734 EAP/PV RELIABILITY: 0.769 ------------------------

Figure B7. Wright Map of Intellectual Flexibility by Item.

91

================================================================================

Intellectual Flexibility MAP OF WLE ESTIMATES AND THRESHOLDS ================================================================================ Generalised-Item Thresholds -------------------------------------------------------------------------------- | | | 5 | | | | | | 4 | XXXX| | | | 3 | | | XXXX| |3.7 2 |1.7 | |2.7 XXXXX| | 1 | XXXXXXXXXX| | | XXXXXXXXX| |1.6 3.6 0 | XXXXXXXX| |2.6 | XXXXXXXXX| -1 | | XXXXX|1.5 3.5 | XXXX| -2 |2.5 XXX|1.4 X| |3.4 X|2.4 -3 X|1.3 | |3.3 | |1.2 2.3 -4 |3.2 |1.1 |3.1 | |2.2 | -5 | | |2.1 | ================================================================================ Each 'X' represents 57.3 cases The labels for thresholds show the levels of item, and step, respectively ================================================================================

Figure B8. Wright Map of Intellectual Flexibility by Threshold.

92

================================================================================

Self Confidence MAP OF WLE ESTIMATES AND RESPONSE MODEL PARAMETER ESTIMATES ================================================================================ Terms in the Model (excl Step terms) +item ---------------------------------------------------- | | | | | | | | 6 | | | | | | | | 5 | | | | | | | | 4 | | | | | | XXXXXXX| | | | 3 | | | | | | | | 2 XXXXXXX| | | | | | XXXXXXX| | 1 | | | | XXXXXXXXXX| | | | XXXXXXXXX| | 0 | | | | XXXXXXX| | | | -1 XXXXXXXX| | XXXXX| | | | XXXX| | -2 | | XXX|1 2 3 | XX| | | | -3 XX| | X| | X| | X| | X| | -4 | | | | | | | | -5 | | | | | | | | -6 | | | | | | | | | | ==================================================== Each 'X' represents 49.1 cases ==================================================== SELF CONFIDENCE RELIABILITY COEFFICIENTS MLE Person separation RELIABILITY: 0.762 WLE Person separation RELIABILITY: 0.768 EAP/PV RELIABILITY: 0.824 ------------------------

Figure B9. Wright Map Self Confidence by Item.

93

================================================================================

Self Confidence MAP OF WLE ESTIMATES AND THRESHOLDS ================================================================================ Generalised-Item Thresholds -------------------------------------------------------------------------------- | | | | 6 | | | | 5 | | | | 4 | | | XXXXXXX| | 3 | | | |2.7 2 XXXXXXX| | | XXXXXXX|1.7 3.7 1 | | XXXXXXXXXX| | XXXXXXXXX|2.6 0 | |1.6 3.6 XXXXXXX| | -1 XXXXXXXX| XXXXX| |1.5 2.5 XXXX|3.5 -2 | XXX| XX|1.4 |2.4 3.4 -3 XX| X| X|1.3 X|2.3 3.3 X| -4 |1.2 |2.2 3.2 |3.1 |1.1 -5 | |2.1 | | -6 | | | | | ================================================================================ Each 'X' represents 49.1 cases The labels for thresholds show the levels ofitem, and step, respectively

Figure B10. Wright Map Self Confidence by Threshold.

94

================================================================================

Social Competence MAP OF WLE ESTIMATES AND RESPONSE MODEL PARAMETER ESTIMATES ================================================================================ Terms in the Model (excl Step terms) +item ---------------------------------------------------- | | | | 7 | | | | | | | | 6 | | | | | | XX| | 5 | | | | | | | | 4 | | XX| | | | | | 3 XXX| | | | | | XXXXXXX| | 2 | | | | XXXXXX| | | | 1 | | XXXXXXXX| | | | XXXXXXXXXX| | 0 | | | | XXXXXXX| | -1 | | XXXXXX| | |1 | XXXXXX|2 | -2 |3 | XXXX| | XXX| | | | -3 XX| | X| | X| | X| | -4 | | | | | | | | -5 | | | | | | | | -6 | | | | | | | | -7 | | | | | | ==================================================== Each 'X' represents 51.6 cases ==================================================== Social Competence RELIABILITY COEFFICIENTS MLE Person separation RELIABILITY: 0.840 WLE Person separation RELIABILITY: 0.840 EAP/PV RELIABILITY: 0.851 -----------------------

Figure B11. Wright Map Social Competence by Item.

95

================================================================================

Social Competence MAP OF WLE ESTIMATES AND THRESHOLDS ================================================================================ Generalised-Item Thresholds -------------------------------------------------------------------------------- | | 7 | | | | 6 | | | XX| 5 | | | | 4 | XX|1.7 | |2.7 3 XXX|3.7 | | XXXXXXX| 2 | | XXXXXX|1.6 |2.6 1 | XXXXXXXX|3.6 | XXXXXXXXXX| 0 | | XXXXXXX|1.5 -1 |2.5 XXXXXX|3.5 | XXXXXX| -2 |1.4 XXXX|2.4 XXX| |3.4 -3 XX| X|1.3 2.3 X|3.3 X| -4 | |1.2 2.2 |3.2 | -5 | |2.1 |1.1 |3.1 -6 | | | | -7 | | | ================================================================================ Each 'X' represents 51.6 cases The labels for thresholds show the levels of item, and step, respectively

Figure B12. Wright Map Social Competence by Threshold.

96

================================================================================

Task Leadership MAP OF WLE ESTIMATES AND RESPONSE MODEL PARAMETER ESTIMATES ================================================================================ Terms in the Model (excl Step terms) +item ---------------------------------------------------- 6 | | | | | | | | | | 5 | | | | XX| | | | | | 4 | | | | | | | | XXX| | 3 | | | | | | XXX| | | | 2 | | XXXXXXXX| | | | | | XXXXXXXX| | 1 | | XXXXXXXXX| | | | | | XXXXXXXXXX| | 0 | | XXXXXXXX| | | | | | -1 XXXXXXX|1 | XXXXXX| | |3 | XXXX|2 | | | -2 XXX| | XXX| | XX| | X| | | | -3 X| | X| | X| | | | | | -4 | | | | | | | | | | -5 | | | | | | | | | | -6 | | ==================================================== Each 'X' represents 44.6 cases ==================================================== Task Leadership RELIABILITY COEFFICIENTS MLE Person separation RELIABILITY: 0.810 WLE Person separation RELIABILITY: 0.810 EAP/PV RELIABILITY: 0.824 ------------------------

Figure B13. Wright Map Task Leadership by Item.

97

================================================================================

Task Leadership MAP OF WLE ESTIMATES AND THRESHOLDS ================================================================================ Generalised-Item Thresholds ------------------------------------------------------------------------------- 6 | | | | | 5 | | XX| | | 4 | | | | XXX|1.7 3 | |3.7 | XXX|2.7 | 2 | XXXXXXXX| | | XXXXXXXX|1.6 3.6 1 | XXXXXXXXX| |2.6 | XXXXXXXXXX| 0 | XXXXXXXX|1.5 | |3.5 -1 XXXXXXX|2.5 XXXXXX| | XXXX|1.4 |3.4 -2 XXX|2.4 XXX| XX|1.3 X|3.3 |2.3 -3 X|1.2 X| X|2.2 3.2 | |1.1 -4 | | |2.1 |3.1 | -5 | | | | | -6 | ================================================================================ Each 'X' represents 44.6 cases The labels for thresholds show the levels of item, and step, respectively

Figure B14. Wright Map Task Leadership by Threshold.

98

================================================================================

Time Management MAP OF WLE ESTIMATES AND RESPONSE MODEL PARAMETER ESTIMATES ================================================================================ Terms in the Model (excl Step terms) +item ---------------------------------------------------- 6 | | | | | | | | | | 5 | | X| | | | | | | | 4 | | | | X| | | | | | 3 | | XX| | | | | | XXXXX| | 2 | | | | XXXXXX| | | | XXXXXXXX| | 1 | | | | XXXXXXXXXX| | | | 0 XXXXXXXX| | | | XXXXXXXX| | |2 | XXXXXXX| | -1 |3 | XXXXXX|1 | XXXX| | | | XXXX| | -2 XX| | XX| | | | XX| | X| | -3 | | X| | X| | | | X| | -4 | | | | | | | | | | -5 | | | | | | | | | | -6 | | ==================================================== Each 'X' represents 43.2 cases ==================================================== Time Management RELIABILITY COEFFICIENTS MLE Person separation RELIABILITY: 0.829 WLE Person separation RELIABILITY: 0.827 EAP/PV RELIABILITY: 0.839 ------------------------

Figure B15. Wright Map Time Management by Item.

99

================================================================================

Time Management MAP OF WLE ESTIMATES AND THRESHOLDS ================================================================================ Generalised-Item Thresholds -------------------------------------------------------------------------------- 6 | | | | | 5 | X| | | | 4 | | X| |3.7 |2.7 3 |1.7 XX| | | XXXXX| 2 | | XXXXXX|2.6 |1.6 3.6 XXXXXXXX| 1 | | XXXXXXXXXX| | 0 XXXXXXXX|2.5 | XXXXXXXX|1.5 3.5 | XXXXXXX| -1 |2.4 XXXXXX| XXXX|1.4 3.4 | XXXX|2.3 -2 XX| XX|3.3 |1.3 XX|2.2 X| -3 | X|3.2 X|1.2 2.1 | X| -4 | |3.1 |1.1 | | -5 | | | | | -6 | ================================================================================ Each 'X' represents 43.2 cases The labels for thresholds show the levels of item, and step, respectively ================================================================================

Figure B16. Wright Map Time Management by Threshold.

100

APPENDIX C

101

Table C1 LEQ Composite -‐ Logit Estimates and 95% Confidence Intervals by Raw Score Scale Raw

Score Estimate Std. Err 1.96 x

Std.Err 95%CI low logit range

95%CI high logit range

Logit spread

Composite LEQ 0 -‐6.29238 1.42177 2.78667 -‐9.08 -‐3.51 5.58 1 -‐5.18314 0.82074 1.60865 -‐6.79 -‐3.57 3.22 2 -‐4.66550 0.63446 1.24354 -‐5.91 -‐3.42 2.49 3 -‐4.32520 0.53449 1.04760 -‐5.37 -‐3.28 2.10 4 -‐4.07242 0.46961 0.92044 -‐4.99 -‐3.15 1.85 5 -‐3.87210 0.42291 0.82890 -‐4.70 -‐3.04 1.66 6 -‐3.70682 0.38738 0.75926 -‐4.47 -‐2.95 1.52 7 -‐3.56657 0.35904 0.70372 -‐4.27 -‐2.86 1.41 8 -‐3.44508 0.33581 0.65819 -‐4.10 -‐2.79 1.32 9 -‐3.33813 0.31633 0.62001 -‐3.96 -‐2.72 1.25 10 -‐3.24276 0.29970 0.58741 -‐3.83 -‐2.66 1.18 11 -‐3.15681 0.28551 0.55960 -‐3.72 -‐2.60 1.12 12 -‐3.07864 0.27300 0.53508 -‐3.61 -‐2.54 1.08 13 -‐3.00700 0.26198 0.51348 -‐3.52 -‐2.49 1.03 14 -‐2.94091 0.25218 0.49427 -‐3.44 -‐2.45 0.99 15 -‐2.87957 0.24341 0.47708 -‐3.36 -‐2.40 0.96 16 -‐2.82234 0.23552 0.46162 -‐3.28 -‐2.36 0.93 17 -‐2.76870 0.22838 0.44762 -‐3.22 -‐2.32 0.90 18 -‐2.71822 0.22178 0.43469 -‐3.15 -‐2.28 0.87 19 -‐2.67051 0.21592 0.42320 -‐3.09 -‐2.25 0.85 20 -‐2.62528 0.21053 0.41264 -‐3.04 -‐2.21 0.83 21 -‐2.58227 0.20557 0.40292 -‐2.99 -‐2.18 0.81 22 -‐2.54123 0.20098 0.39392 -‐2.94 -‐2.15 0.79 23 -‐2.50198 0.19674 0.38561 -‐2.89 -‐2.12 0.78 24 -‐2.46436 0.19278 0.37785 -‐2.84 -‐2.09 0.76 25 -‐2.42819 0.18917 0.37077 -‐2.80 -‐2.06 0.75 26 -‐2.39336 0.18579 0.36415 -‐2.76 -‐2.03 0.73 27 -‐2.35975 0.18264 0.35797 -‐2.72 -‐2.00 0.72 28 -‐2.32726 0.17969 0.35219 -‐2.68 -‐1.98 0.71 29 -‐2.29580 0.17695 0.34682 -‐2.64 -‐1.95 0.70 30 -‐2.26527 0.17438 0.34178 -‐2.61 -‐1.92 0.69 31 -‐2.23562 0.17197 0.33706 -‐2.57 -‐1.90 0.68

102

Table C1 -‐ Continued Scale Raw




Logit spread

32 -‐2.20676 0.16972 0.33265 -‐2.54 -‐1.87 0.67 33 -‐2.17865 0.16760 0.32850 -‐2.51 -‐1.85 0.66 34 -‐2.15122 0.16562 0.32462 -‐2.48 -‐1.83 0.65 35 -‐2.12443 0.16375 0.32095 -‐2.45 -‐1.80 0.65 36 -‐2.09823 0.16200 0.31752 -‐2.42 -‐1.78 0.64 37 -‐2.07258 0.16036 0.31431 -‐2.39 -‐1.76 0.63 38 -‐2.04744 0.15881 0.31127 -‐2.36 -‐1.74 0.63 39 -‐2.02277 0.15735 0.30841 -‐2.33 -‐1.71 0.62 40 -‐1.99855 0.15599 0.30574 -‐2.30 -‐1.69 0.62 41 -‐1.97474 0.15470 0.30321 -‐2.28 -‐1.67 0.61 42 -‐1.95130 0.15350 0.30086 -‐2.25 -‐1.65 0.61 43 -‐1.92823 0.15237 0.29865 -‐2.23 -‐1.63 0.60 44 -‐1.90549 0.15131 0.29657 -‐2.20 -‐1.61 0.60 45 -‐1.88305 0.15031 0.29461 -‐2.18 -‐1.59 0.59 46 -‐1.86090 0.14938 0.29278 -‐2.15 -‐1.57 0.59 47 -‐1.83902 0.14851 0.29108 -‐2.13 -‐1.55 0.59 48 -‐1.81739 0.14770 0.28949 -‐2.11 -‐1.53 0.58 49 -‐1.79598 0.14694 0.28800 -‐2.08 -‐1.51 0.58 50 -‐1.77480 0.14621 0.28657 -‐2.06 -‐1.49 0.58 51 -‐1.75380 0.14557 0.28532 -‐2.04 -‐1.47 0.58 52 -‐1.73298 0.14497 0.28414 -‐2.02 -‐1.45 0.57 53 -‐1.71232 0.14443 0.28308 -‐2.00 -‐1.43 0.57 54 -‐1.69181 0.14392 0.28208 -‐1.97 -‐1.41 0.57 55 -‐1.67144 0.14346 0.28118 -‐1.95 -‐1.39 0.57 56 -‐1.65120 0.14305 0.28038 -‐1.93 -‐1.37 0.57 57 -‐1.63106 0.14267 0.27963 -‐1.91 -‐1.35 0.56 58 -‐1.61103 0.14234 0.27899 -‐1.89 -‐1.33 0.56 59 -‐1.59108 0.14204 0.27840 -‐1.87 -‐1.31 0.56 60 -‐1.57120 0.14179 0.27791 -‐1.85 -‐1.29 0.56 61 -‐1.55139 0.14157 0.27748 -‐1.83 -‐1.27 0.56 62 -‐1.53164 0.14139 0.27712 -‐1.81 -‐1.25 0.56 63 -‐1.51192 0.14125 0.27685 -‐1.79 -‐1.24 0.56 64 -‐1.49224 0.14114 0.27663 -‐1.77 -‐1.22 0.56 65 -‐1.47258 0.14107 0.27650 -‐1.75 -‐1.20 0.56 66 -‐1.45294 0.14103 0.27642 -‐1.73 -‐1.18 0.56

103





Logit spread

67 -‐1.43329 0.14103 0.27642 -‐1.71 -‐1.16 0.56 68 -‐1.41364 0.14106 0.27648 -‐1.69 -‐1.14 0.56 69 -‐1.39397 0.14113 0.27661 -‐1.67 -‐1.12 0.56 70 -‐1.37428 0.14123 0.27681 -‐1.65 -‐1.10 0.56 71 -‐1.35455 0.14136 0.27707 -‐1.63 -‐1.08 0.56 72 -‐1.33477 0.14152 0.27738 -‐1.61 -‐1.06 0.56 73 -‐1.31494 0.14172 0.27777 -‐1.59 -‐1.04 0.56 74 -‐1.29504 0.14193 0.27818 -‐1.57 -‐1.02 0.56 75 -‐1.27507 0.14219 0.27869 -‐1.55 -‐1.00 0.56 76 -‐1.25502 0.14249 0.27928 -‐1.53 -‐0.98 0.56 77 -‐1.23488 0.14281 0.27991 -‐1.51 -‐0.95 0.56 78 -‐1.21464 0.14317 0.28061 -‐1.50 -‐0.93 0.57 79 -‐1.19429 0.14356 0.28138 -‐1.48 -‐0.91 0.57 80 -‐1.17381 0.14398 0.28220 -‐1.46 -‐0.89 0.57 81 -‐1.15321 0.14443 0.28308 -‐1.44 -‐0.87 0.57 82 -‐1.13247 0.14492 0.28404 -‐1.42 -‐0.85 0.57 83 -‐1.11157 0.14543 0.28504 -‐1.40 -‐0.83 0.58 84 -‐1.09052 0.14598 0.28612 -‐1.38 -‐0.80 0.58 85 -‐1.06930 0.14655 0.28724 -‐1.36 -‐0.78 0.58 86 -‐1.04790 0.14716 0.28843 -‐1.34 -‐0.76 0.58 87 -‐1.02631 0.14781 0.28971 -‐1.32 -‐0.74 0.58 88 -‐1.00452 0.14848 0.29102 -‐1.30 -‐0.71 0.59 89 -‐0.98251 0.14918 0.29239 -‐1.27 -‐0.69 0.59 90 -‐0.96029 0.14992 0.29384 -‐1.25 -‐0.67 0.59 91 -‐0.93783 0.15069 0.29535 -‐1.23 -‐0.64 0.60 92 -‐0.91513 0.15149 0.29692 -‐1.21 -‐0.62 0.60 93 -‐0.89218 0.15233 0.29857 -‐1.19 -‐0.59 0.60 94 -‐0.86896 0.15320 0.30027 -‐1.17 -‐0.57 0.61 95 -‐0.84546 0.15410 0.30204 -‐1.15 -‐0.54 0.61 96 -‐0.82167 0.15503 0.30386 -‐1.13 -‐0.52 0.61 97 -‐0.79758 0.15599 0.30574 -‐1.10 -‐0.49 0.62 98 -‐0.77318 0.15699 0.30770 -‐1.08 -‐0.47 0.62 99 -‐0.74844 0.15802 0.30972 -‐1.06 -‐0.44 0.62 100 -‐0.72337 0.15908 0.31180 -‐1.04 -‐0.41 0.63 101 -‐0.69795 0.16017 0.31393 -‐1.01 -‐0.38 0.63

104





Logit spread

102 -‐0.67217 0.16130 0.31615 -‐0.99 -‐0.36 0.64 103 -‐0.64600 0.16246 0.31842 -‐0.96 -‐0.33 0.64 104 -‐0.61945 0.16365 0.32075 -‐0.94 -‐0.30 0.65 105 -‐0.59249 0.16487 0.32315 -‐0.92 -‐0.27 0.65 106 -‐0.56512 0.16612 0.32560 -‐0.89 -‐0.24 0.66 107 -‐0.53732 0.16741 0.32812 -‐0.87 -‐0.21 0.66 108 -‐0.50907 0.16872 0.33069 -‐0.84 -‐0.18 0.67 109 -‐0.48036 0.17007 0.33334 -‐0.81 -‐0.15 0.67 110 -‐0.45118 0.17141 0.33596 -‐0.79 -‐0.12 0.68 111 -‐0.42152 0.17282 0.33873 -‐0.76 -‐0.08 0.68 112 -‐0.39136 0.17427 0.34157 -‐0.73 -‐0.05 0.69 113 -‐0.36068 0.17574 0.34445 -‐0.71 -‐0.02 0.69 114 -‐0.32948 0.17724 0.34739 -‐0.68 0.02 0.70 115 -‐0.29773 0.17877 0.35039 -‐0.65 0.05 0.71 116 -‐0.26543 0.18032 0.35343 -‐0.62 0.09 0.71 117 -‐0.23255 0.18191 0.35654 -‐0.59 0.12 0.72 118 -‐0.19909 0.18352 0.35970 -‐0.56 0.16 0.72 119 -‐0.16503 0.18517 0.36293 -‐0.53 0.20 0.73 120 -‐0.13035 0.18684 0.36621 -‐0.50 0.24 0.74 121 -‐0.09504 0.18854 0.36954 -‐0.46 0.27 0.74 122 -‐0.05909 0.19024 0.37287 -‐0.43 0.31 0.75 123 -‐0.02246 0.19203 0.37638 -‐0.40 0.35 0.76 124 0.01485 0.19383 0.37991 -‐0.37 0.39 0.76 125 0.05285 0.19563 0.38343 -‐0.33 0.44 0.77 126 0.09158 0.19753 0.38716 -‐0.30 0.48 0.78 127 0.13105 0.19944 0.39090 -‐0.26 0.52 0.79 128 0.17129 0.20138 0.39470 -‐0.22 0.57 0.79 129 0.21231 0.20338 0.39862 -‐0.19 0.61 0.80 130 0.25414 0.20541 0.40260 -‐0.15 0.66 0.81 131 0.29681 0.20750 0.40670 -‐0.11 0.70 0.82 132 0.34035 0.20964 0.41089 -‐0.07 0.75 0.83 133 0.38479 0.21184 0.41521 -‐0.03 0.80 0.84 134 0.43016 0.21411 0.41966 0.01 0.85 0.84 135 0.47651 0.21645 0.42424 0.05 0.90 0.85 136 0.52388 0.21887 0.42899 0.09 0.95 0.86

105

Table C1 -‐ Continued Subscale Raw




Logit spread

137 0.57231 0.22143 0.43400 0.14 1.01 0.87 138 0.62186 0.22404 0.43912 0.18 1.06 0.88 139 0.67259 0.22677 0.44447 0.23 1.12 0.89 140 0.72456 0.22962 0.45006 0.27 1.17 0.91 141 0.77784 0.23261 0.45592 0.32 1.23 0.92 142 0.83252 0.23574 0.46205 0.37 1.29 0.93 143 0.88869 0.23905 0.46854 0.42 1.36 0.94 144 0.94645 0.24255 0.47540 0.47 1.42 0.96 145 1.00593 0.24626 0.48267 0.52 1.49 0.97 146 1.06726 0.25021 0.49041 0.58 1.56 0.99 147 1.13059 0.25444 0.49870 0.63 1.63 1.00 148 1.19611 0.25898 0.50760 0.69 1.70 1.02 149 1.26403 0.26387 0.51719 0.75 1.78 1.04 150 1.33459 0.26917 0.52757 0.81 1.86 1.06 151 1.40806 0.27494 0.53888 0.87 1.95 1.08 152 1.48480 0.28125 0.55125 0.93 2.04 1.11 153 1.56519 0.28819 0.56485 1.00 2.13 1.13 154 1.64974 0.29586 0.57989 1.07 2.23 1.16 155 1.73901 0.30454 0.59690 1.14 2.34 1.20 156 1.83375 0.31423 0.61589 1.22 2.45 1.24 157 1.93487 0.32523 0.63745 1.30 2.57 1.28 158 2.04353 0.33787 0.66223 1.38 2.71 1.33 159 2.16126 0.35257 0.69104 1.47 2.85 1.39 160 2.29007 0.36994 0.72508 1.56 3.02 1.46 161 2.43278 0.39084 0.76605 1.67 3.20 1.54 162 2.59337 0.41659 0.81652 1.78 3.41 1.64 163 2.77786 0.44926 0.88055 1.90 3.66 1.77 164 2.99591 0.49295 0.96618 2.03 3.96 1.94 165 3.26458 0.55457 1.08696 2.18 4.35 2.18 166 3.61836 0.65094 1.27584 2.34 4.89 2.56 167 4.14647 0.83364 1.63393 2.51 5.78 3.27 168 5.26236 1.43223 2.80717 2.46 8.07 5.62

106

Table C2 LEQ Subdomains -‐ Logit Estimates and 95% Confidence Intervals by Raw Score Subscale Raw




Logit spread

Achievement Motivation

0 -‐6.44349 1.24974 2.44949 -‐8.89 -‐3.99 4.90 1 -‐5.57730 0.76308 1.49564 -‐7.07 -‐4.08 3.00 2 -‐5.16052 0.62367 1.22239 -‐6.38 -‐3.94 2.45 3 -‐4.85731 0.55887 1.09539 -‐5.95 -‐3.76 2.20 4 -‐4.59886 0.52544 1.02986 -‐5.63 -‐3.57 2.06 5 -‐4.35829 0.50885 0.99735 -‐5.36 -‐3.36 2.00 6 -‐4.12083 0.50293 0.98574 -‐5.11 -‐3.14 1.98 7 -‐3.87873 0.50374 0.98733 -‐4.87 -‐2.89 1.98 8 -‐3.62913 0.50873 0.99711 -‐4.63 -‐2.63 2.00 9 -‐3.37200 0.51677 1.01287 -‐4.38 -‐2.36 2.03 10 -‐3.10657 0.52854 1.03594 -‐4.14 -‐2.07 2.08 11 -‐2.82819 0.54573 1.06963 -‐3.90 -‐1.76 2.14 12 -‐2.52748 0.57074 1.11865 -‐3.65 -‐1.41 2.24 13 -‐2.18936 0.60588 1.18752 -‐3.38 -‐1.00 2.38 14 -‐1.79263 0.65197 1.27786 -‐3.07 -‐0.51 2.56 15 -‐1.31610 0.70511 1.38202 -‐2.70 0.07 2.77 16 -‐0.75927 0.75511 1.48002 -‐2.24 0.72 2.97 17 -‐0.14278 0.79827 1.56461 -‐1.71 1.42 3.13 18 0.52660 0.84356 1.65338 -‐1.13 2.18 3.31 19 1.25564 0.91097 1.78550 -‐0.53 3.04 3.58 20 2.10610 1.06361 2.08468 0.02 4.19 4.17 21 3.49038 1.66886 3.27097 0.22 6.76 6.55 Active Initiative

0 -‐7.20987 1.43454 2.81170 -‐10.02 -‐4.40 5.63 1 -‐6.11049 0.85503 1.67586 -‐7.79 -‐4.43 3.36 2 -‐5.57373 0.68902 1.35048 -‐6.92 -‐4.22 2.71 3 -‐5.19144 0.60959 1.19480 -‐6.39 -‐4.00 2.39 4 -‐4.87683 0.56581 1.10899 -‐5.99 -‐3.77 2.22 5 -‐4.59557 0.54181 1.06195 -‐5.66 -‐3.53 2.13 6 -‐4.32973 0.53064 1.04005 -‐5.37 -‐3.29 2.09 7 -‐4.06718 0.52918 1.03719 -‐5.10 -‐3.03 2.08 8 -‐3.79828 0.53574 1.05005 -‐4.85 -‐2.75 2.11 Table C2 -‐ Continued

107

Subscale Raw Score

Estimate Std. Err 1.96 x Std.Err

95%CI low logit range


Logit spread

9 -‐3.51438 0.54921 1.07645 -‐4.59 -‐2.44 2.16 10 -‐3.20805 0.56851 1.11428 -‐4.32 -‐2.09 2.23 11 -‐2.87351 0.59262 1.16154 -‐4.04 -‐1.71 2.33 12 -‐2.50586 0.62105 1.21726 -‐3.72 -‐1.29 2.44 13 -‐2.09825 0.65377 1.28139 -‐3.38 -‐0.82 2.57 14 -‐1.64020 0.68973 1.35187 -‐2.99 -‐0.29 2.71 15 -‐1.12517 0.72405 1.41914 -‐2.54 0.29 2.84 16 -‐0.56660 0.75078 1.47153 -‐2.04 0.90 2.95 17 0.01330 0.77378 1.51661 -‐1.50 1.53 3.04 18 0.61465 0.80679 1.58131 -‐0.97 2.20 3.17 19 1.26528 0.87218 1.70947 -‐0.44 2.97 3.42 20 2.05013 1.02817 2.01521 0.03 4.07 4.04 21 3.38397 1.63167 3.19807 0.19 6.58 6.40 Emotional Control

0 -‐6.67711 1.42111 2.78538 -‐9.46 -‐3.89 5.58 1 -‐5.59338 0.84440 1.65502 -‐7.25 -‐3.94 3.32 2 -‐5.07298 0.68132 1.33539 -‐6.41 -‐3.74 2.68 3 -‐4.70394 0.60553 1.18684 -‐5.89 -‐3.52 2.38 4 -‐4.39756 0.56637 1.11009 -‐5.51 -‐3.29 2.23 5 -‐4.11791 0.54785 1.07379 -‐5.19 -‐3.04 2.15 6 -‐3.84502 0.54322 1.06471 -‐4.91 -‐2.78 2.13 7 -‐3.56462 0.54906 1.07616 -‐4.64 -‐2.49 2.16 8 -‐3.26612 0.56286 1.10321 -‐4.37 -‐2.16 2.21 9 -‐2.94241 0.58214 1.14099 -‐4.08 -‐1.80 2.29 10 -‐2.59064 0.60489 1.18558 -‐3.78 -‐1.41 2.38 11 -‐2.20930 0.63054 1.23586 -‐3.45 -‐0.97 2.48 12 -‐1.79396 0.65994 1.29348 -‐3.09 -‐0.50 2.59 13 -‐1.33555 0.69419 1.36061 -‐2.70 0.03 2.73 14 -‐0.82110 0.73268 1.43605 -‐2.26 0.61 2.88 15 -‐0.24202 0.77062 1.51042 -‐1.75 1.27 3.03 16 0.38792 0.80265 1.57319 -‐1.19 1.96 3.15 17 1.05563 0.83446 1.63554 -‐0.58 2.69 3.28 18 1.77749 0.87419 1.71341 0.06 3.49 3.43

108

Table C2-‐ Continued Subscale Raw




Logit spread

19 2.55552 0.93427 1.83117 0.72 4.39 3.67 20 3.43185 1.07840 2.11366 1.32 5.55 4.23 21 4.82219 1.67725 3.28741 1.53 8.11 6.58 Intellectual Flexibility

0 -‐6.12236 1.28282 2.51433 -‐8.64 -‐3.61 5.03 1 -‐5.22061 0.81144 1.59042 -‐6.81 -‐3.63 3.19 2 -‐4.72054 0.65222 1.27835 -‐6.00 -‐3.44 2.56 3 -‐4.37205 0.56889 1.11502 -‐5.49 -‐3.26 2.24 4 -‐4.10239 0.52182 1.02277 -‐5.13 -‐3.08 2.05 5 -‐3.87143 0.49573 0.97163 -‐4.84 -‐2.90 1.95 6 -‐3.65733 0.48341 0.94748 -‐4.60 -‐2.71 1.90 7 -‐3.44655 0.48145 0.94364 -‐4.39 -‐2.50 1.89 8 -‐3.22869 0.48804 0.95656 -‐4.19 -‐2.27 1.92 9 -‐2.99534 0.50194 0.98380 -‐3.98 -‐2.01 1.97 10 -‐2.73951 0.52214 1.02339 -‐3.76 -‐1.72 2.05 11 -‐2.45600 0.54763 1.07335 -‐3.53 -‐1.38 2.15 12 -‐2.13981 0.57830 1.13347 -‐3.27 -‐1.01 2.27 13 -‐1.78285 0.61473 1.20487 -‐2.99 -‐0.58 2.41 14 -‐1.37235 0.65678 1.28729 -‐2.66 -‐0.09 2.58 15 -‐0.89668 0.70097 1.37390 -‐2.27 0.48 2.75 16 -‐0.35956 0.74105 1.45246 -‐1.81 1.09 2.91 17 0.22270 0.77706 1.52304 -‐1.30 1.75 3.05 18 0.84672 0.81937 1.60597 -‐0.76 2.45 3.22 19 1.53165 0.88942 1.74326 -‐0.21 3.27 3.49 20 2.35291 1.04712 2.05236 0.30 4.41 4.11 21 3.72094 1.65469 3.24319 0.48 6.96 6.49 Self Confidence

0 -‐6.06524 1.25890 2.46744 -‐8.53 -‐3.60 4.94 1 -‐5.17982 0.75884 1.48733 -‐6.67 -‐3.69 2.98 2 -‐4.76245 0.61433 1.20409 -‐5.97 -‐3.56 2.41 3 -‐4.46795 0.54641 1.07096 -‐5.54 -‐3.40 2.15 4 -‐4.22433 0.51073 1.00103 -‐5.23 -‐3.22 2.01

109

Table C2-‐ Continued Subsclase Raw




Logit spread

5 -‐4.00278 0.49321 0.96669 -‐4.97 -‐3.04 1.94 6 -‐3.78681 0.48817 0.95681 -‐4.74 -‐2.83 1.92 7 -‐3.56500 0.49260 0.96550 -‐4.53 -‐2.60 1.94 8 -‐3.32811 0.50452 0.98886 -‐4.32 -‐2.34 1.98 9 -‐3.06930 0.52193 1.02298 -‐4.09 -‐2.05 2.05 10 -‐2.78545 0.54251 1.06332 -‐3.85 -‐1.72 2.13 11 -‐2.47658 0.56481 1.10703 -‐3.58 -‐1.37 2.22 12 -‐2.14219 0.58872 1.15389 -‐3.30 -‐0.99 2.31 13 -‐1.77845 0.61475 1.20491 -‐2.98 -‐0.57 2.41 14 -‐1.37931 0.64283 1.25995 -‐2.64 -‐0.12 2.52 15 -‐0.94117 0.67145 1.31604 -‐2.26 0.37 2.64 16 -‐0.46728 0.69968 1.37137 -‐1.84 0.90 2.75 17 0.03795 0.73143 1.43360 -‐1.40 1.47 2.87 18 0.58388 0.77740 1.52370 -‐0.94 2.11 3.05 19 1.20777 0.85857 1.68280 -‐0.48 2.89 3.37 20 2.00737 1.03236 2.02343 -‐0.02 4.03 4.05 21 3.39893 1.66054 3.25466 0.14 6.65 6.51 Social Competence

0 -‐6.94436 1.48250 2.90570 -‐9.85 -‐4.04 5.82 1 -‐5.78913 0.89350 1.75126 -‐7.54 -‐4.04 3.51 2 -‐5.20432 0.72651 1.42396 -‐6.63 -‐3.78 2.85 3 -‐4.77673 0.64822 1.27051 -‐6.05 -‐3.51 2.55 4 -‐4.41628 0.60669 1.18911 -‐5.61 -‐3.23 2.38 5 -‐4.08685 0.58516 1.14691 -‐5.23 -‐2.94 2.30 6 -‐3.76940 0.57646 1.12986 -‐4.90 -‐2.64 2.26 7 -‐3.45218 0.57697 1.13086 -‐4.58 -‐2.32 2.27 8 -‐3.12678 0.58479 1.14619 -‐4.27 -‐1.98 2.30 9 -‐2.78618 0.59918 1.17439 -‐3.96 -‐1.61 2.35 10 -‐2.42234 0.62016 1.21551 -‐3.64 -‐1.21 2.44 11 -‐2.02506 0.64771 1.26951 -‐3.29 -‐0.76 2.54 12 -‐1.58311 0.68095 1.33466 -‐2.92 -‐0.25 2.67 13 -‐1.08806 0.71740 1.40610 -‐2.49 0.32 2.82 14 -‐0.53628 0.75352 1.47690 -‐2.01 0.94 2.96 15 0.07033 0.78490 1.53840 -‐1.47 1.61 3.08 16 0.71623 0.80866 1.58497 -‐0.87 2.30 3.17

110

Table C2-‐ Continued Subscale Raw




Logit spread

17 1.38522 0.83044 1.62766 -‐0.24 3.01 3.26 18 2.08256 0.86124 1.68803 0.39 3.77 3.38 19 2.82481 0.92070 1.80457 1.02 4.63 3.61 20 3.68338 1.06964 2.09649 1.59 5.78 4.20 21 5.07421 1.67535 3.28369 1.79 8.36 6.57 Task Leadership

0 -‐5.75998 1.46723 2.87577 -‐8.64 -‐2.88 5.76 1 -‐4.60706 0.86177 1.68907 -‐6.30 -‐2.92 3.38 2 -‐4.05576 0.68504 1.34268 -‐5.40 -‐2.71 2.69 3 -‐3.67657 0.59908 1.17420 -‐4.85 -‐2.50 2.35 4 -‐3.37405 0.55078 1.07953 -‐4.45 -‐2.29 2.16 5 -‐3.11099 0.52278 1.02465 -‐4.14 -‐2.09 2.05 6 -‐2.86767 0.50838 0.99642 -‐3.86 -‐1.87 2.00 7 -‐2.63156 0.50416 0.98815 -‐3.62 -‐1.64 1.98 8 -‐2.39284 0.50858 0.99682 -‐3.39 -‐1.40 2.00 9 -‐2.14214 0.52098 1.02112 -‐3.16 -‐1.12 2.05 10 -‐1.86998 0.54099 1.06034 -‐2.93 -‐0.81 2.13 11 -‐1.56666 0.56799 1.11326 -‐2.68 -‐0.45 2.23 12 -‐1.22329 0.60062 1.17722 -‐2.40 -‐0.05 2.36 13 -‐0.83318 0.63654 1.24762 -‐2.08 0.41 2.50 14 -‐0.39295 0.67262 1.31834 -‐1.71 0.93 2.64 15 0.09566 0.70540 1.38258 -‐1.29 1.48 2.77 16 0.62303 0.73319 1.43705 -‐0.81 2.06 2.88 17 1.17791 0.76041 1.49040 -‐0.31 2.67 2.99 18 1.76332 0.79856 1.56518 0.20 3.33 3.14 19 2.40843 0.86948 1.70418 0.70 4.11 3.41 20 3.20124 1.03106 2.02088 1.18 5.22 4.05 21 4.55577 1.64281 3.21991 1.34 7.78 6.44 Time Management

0 -‐5.57978 1.45441 2.85064 -‐8.43 -‐2.73 5.71 1 -‐4.46281 0.88002 1.72484 -‐6.19 -‐2.74 3.45 2 -‐3.89351 0.71530 1.40199 -‐5.30 -‐2.49 2.81 3 -‐3.47601 0.63541 1.24540 -‐4.72 -‐2.23 2.50 4 -‐3.12780 0.59009 1.15658 -‐4.28 -‐1.97 2.32 Table C2-‐ Continued

111

Subscale Raw Score

Estimate Std. Err 1.96 x Std.Err

95%CI low logit range


Logit spread

5 -‐2.81664 0.56341 1.10428 -‐3.92 -‐1.71 2.21 6 -‐2.52523 0.54880 1.07565 -‐3.60 -‐1.45 2.16 7 -‐2.24241 0.54292 1.06412 -‐3.31 -‐1.18 2.13 8 -‐1.95989 0.54400 1.06624 -‐3.03 -‐0.89 2.14 9 -‐1.67091 0.55114 1.08023 -‐2.75 -‐0.59 2.17 10 -‐1.36915 0.56398 1.10540 -‐2.47 -‐0.26 2.22 11 -‐1.04775 0.58253 1.14176 -‐2.19 0.09 2.29 12 -‐0.69857 0.60666 1.18905 -‐1.89 0.49 2.38 13 -‐0.31228 0.63574 1.24605 -‐1.56 0.93 2.50 14 0.11945 0.66714 1.30759 -‐1.19 1.43 2.62 15 0.59703 0.69601 1.36418 -‐0.77 1.96 2.73 16 1.10692 0.71953 1.41028 -‐0.30 2.52 2.83 17 1.63541 0.74323 1.45673 0.18 3.09 2.92 18 2.18936 0.78008 1.52896 0.66 3.72 3.06 19 2.80279 0.85144 1.66882 1.13 4.47 3.34 20 3.56571 1.01333 1.98613 1.58 5.55 3.98 21 4.89138 1.62071 3.17659 1.71 8.07 6.36

112

APPENDIX D

113

Table D1 LEQ (Composite) -‐ DIF by Gender LEQ by category: Gender

Deviance Final Estimated Parameters

Chi-‐square test v. previous model: change in parameters; df

sig >.05

category Mean Parameter Estimate

item-‐gender+item*step 255436 170 Male .037 (0.013) item-‐gender+ item*gender+ item*step 255004 193

change: 432; df:23 yes Female neg -‐0.037

Item Name Item No. Estimate error

sig >.05


TimeMgtPLAN 1 0.016 0.015 no SocCompSUCC 2 0.083 0.015 yes 0.166 negligible AchMotDETAI 3 0.033 0.016 yes 0.066 negligible IntFlexCHAN 4 -‐0.011 0.016 no TskLdWORK 5 -‐0.037 0.014 yes 0.074 negligible EmotContST 6 -‐0.117 0.014 yes 0.234 negligible ActIniBUSY 7 0.072 0.016 yes 0.144 negligible SelfConABI 8 -‐0.045 0.015 yes 0.09 negligible TimeMgtWAST 9 0.004 0.014 no SocCompCOMP 10 0.087 0.016 yes 0.174 negligible AchMotRESUL 11 0.008 0.017 no IntFlexNEW 12 0.054 0.018 yes 0.108 negligible TskLdLEADER 13 -‐0.001 0.015 no EmotContNEW 14 -‐0.103 0.016 yes 0.206 negligible ActIniENRG 15 0.002 0.015 no SelfConAPP 16 -‐0.146 0.016 yes 0.292 negligible TimeMgtMANA 17 0.03 0.015 yes 0.06 negligible SlfCompCOMM 18 0.134 0.016 yes 0.268 negligible AchMotPOSSI 19 0.05 0.017 yes 0.1 negligible IntFlexADAP 20 0.026 0.018 no TskLdMOTIVA 21 0.008 0.015 no EmotContWRO 22 -‐0.088 0.015 yes 0.176 negligible ActIniGETT 23 0.014 0.015 no SelfConBELI 24 -‐0.073 constrained likely 0.146 negligible *negligible= ES<.426 intermediate=.426<ES>.638 large=ES>.638

114

Table D2 LEQ (Composite) -‐ DIF by Voluntary Status

LEQ -‐ H Deviance


Chi-‐square test v. previous model: change in parameters; df sig >.05 category

Mean Param Estimate

item-‐voluntary+item*step 243787 170 not voluntary 1.194

item-‐voluntary+ item*voluntary+item*step 243610 193

change:177; df23 yes voluntary neg -‐1.194

Item Number Item No. estimate error sig


TimeMgtPLAN 1 -‐0.049 0.022 yes 0.098 negligible SocCompSUCC 2 -‐0.078 0.023 yes 0.156 negligible AchMotDETAI 3 0.017 0.023 no IntFlexCHAN 4 0.06 0.023 yes 0.12 negligible TskLdWORK 5 0.194 0.021 yes 0.388 negligible EmotContST 6 0.035 0.021 no ActIniBUSY 7 -‐0.093 0.024 yes 0.186 negligible SelfConABI 8 0.015 0.021 no TimeMgtWAST 9 -‐0.033 0.021 no SocCompCOMP 10 -‐0.127 0.024 yes 0.254 negligible AchMotRESUL 11 -‐0.104 0.026 yes 0.208 negligible IntFlexNEW 12 0.063 0.025 yes 0.126 negligible TskLdLEADER 13 0.06 0.022 yes 0.12 negligible EmotContNEW 14 0.075 0.022 yes 0.15 negligible ActIniENRG 15 0.019 0.022 no SelfConAPP 16 0.03 0.023 no TimeMgtMANA 17 -‐0.067 0.022 yes 0.134 negligible SlfCompCOMM 18 -‐0.127 0.024 yes 0.254 negligible AchMotPOSSI 19 -‐0.02 0.025 no IntFlexADAP 20 0.095 0.025 yes 0.19 negligible TskLdMOTIVA 21 0.078 0.022 yes 0.156 negligible EmotContWRO 22 0.09 0.022 yes 0.18 negligible ActIniGETT 23 -‐0.017 0.022 no SelfConBELI 24 -‐0.114 constrained possible 0.228 negligible *magnitude: negligible= ES<.426, intermediate=.426<ES>.638, large=ES>.638

115

Table D3 LEQ (Composite) -‐ DIF by Age

LEQ -‐ H Deviance


Chi-‐square test v. previous model: change in parameters; df

sig >.05 category

Mean Parameter Estimate

item-‐agecat3+item*step 256202 171 <19 Mean Est. (SE) 0.029 (.004)

item-‐agecat3+ item*agecat3+ item*step 255808 217 change:394; df:46 yes

19_to_24 mean est.(SE) 0.019 (.004)

item-‐agecat3+ item*agecat3+ item*step

>24 mean est. (SE)

negative -‐0.047 (.006)

Item No.

Item Title category estimate error sig >.05

effect size 1* Magnitude ***

effect size 2** Magnitude ***

1 TimeMgtPLAN <19 -‐0.02 0.013 no 1 TimeMgtPLAN 19-‐24 0.106 0.014 yes 0.126 negligible 0.192 negligible 1 TimeMgtPLAN >24 -‐0.086 0.019 yes 0.066 negligible 0.192 negligible 2 SocCompSUCC <19 -‐0.097 0.013 yes 0.102 negligible 0.189 negligible 2 SocCompSUCC 19-‐24 0.005 0.014 no 2 SocCompSUCC >24 0.092 0.019 yes 0.189 negligible 0.087 negligible 3 AchMotDETAI <19 0.069 0.013 yes 0.019 negligible 0.188 negligible 3 AchMotDETAI 19-‐24 0.05 0.015 yes 0.019 negligible 0.169 negligible 3 AchMotDETAI >24 -‐0.119 0.02 yes 0.188 negligible 0.169 negligible 4 IntFlexCHAN <19 -‐0.002 0.013 no 4 IntFlexCHAN 19-‐24 0.016 0.014 no 4 IntFlexCHAN >24 -‐0.014 0.02 no 5 TskLdWORK <19 0.224 0.013 yes 0.27 negligible 0.402 negligible 5 TskLdWORK 19-‐24 -‐0.046 0.014 yes 0.27 negligible 0.132 negligible 5 TskLdWORK >24 -‐0.178 0.019 yes 0.402 negligible 0.132 negligible 6 EmotContST <19 0.035 0.012 yes 0.099 negligible 0.005 negligible 6 EmotContST 19-‐24 -‐0.064 0.014 yes 0.099 negligible 0.094 negligible 6 EmotContST >24 0.03 0.018 no 7 ActIniBUSY <19 0.021 0.013 no 7 ActIniBUSY 19-‐24 -‐0.005 0.014 no 7 ActIniBUSY >24 -‐0.017 0.019 no 8 SelfConABI <19 -‐0.023 0.013 no 8 SelfConABI 19-‐24 -‐0.072 0.014 yes 0.049 negligible 0.167 negligible 8 SelfConABI >24 0.095 0.019 yes 0.118 negligible 0.167 negligible 9 TimeMgtWAS <19 0.015 0.012 no 9 TimeMgtWAS 19-‐24 0.083 0.013 yes 0.068 negligible 0.181 negligible 9 TimeMgtWAS >24 -‐0.098 0.018 yes 0.113 negligible 0.181 negligible

*Effect size 1: for <19 = <19 compared with 19-‐24; for 19-‐24= 19 -‐24 compared with <19; for >24= >24 compared with 19-‐24 **Effect size 2: for <19 = < 19 compared with >24; for 19-‐24 = 19-‐24 compared with >24; for >24 = >24 compared with <19 ***negligible= ES<.426 intermediate=.426<ES>.638 large=ES>.638

116

Table D3 (continued) LEQ (Composite) -‐ DIF by Age Item Number Item Title category estimate error sig >.05 effect size 1* magnitude effect size 2** magnitude

10 SocCompCOMP <19 -‐0.033 0.013 yes 0.015 negligible 0.085 negligible 10 SocCompCOMP 19-‐24 -‐0.018 0.014 no 10 SocCompCOMP >24 0.052 0.019 yes 0.085 negligible 0.07 negligible 11 AchMotRESUL <19 0.016 0.014 no 11 AchMotRESUL 19-‐24 0.042 0.015 yes 0.026 negligible 0.1 negligible 11 AchMotRESUL >24 -‐0.058 0.02 yes 0.074 negligible 0.1 negligible 12 IntFlexNEW <19 -‐0.008 0.014 no 12 IntFlexNEW 19-‐24 -‐0.016 0.015 no 12 IntFlexNEW >24 0.025 0.021 no 13 TskLdLEADER <19 0.004 0.013 no 13 TskLdLEADER 19-‐24 0.008 0.014 no 13 TskLdLEADER >24 -‐0.012 0.019 no 14 EmotContNEW <19 -‐0.006 0.013 no 14 EmotContNEW 19-‐24 -‐0.073 0.014 yes 0.067 negligible 0.152 negligible 14 EmotContNEW >24 0.079 0.019 yes 0.085 negligible 0.152 negligible 15 ActIniENRG <19 -‐0.026 0.013 yes 0.015 negligible 0.092 negligible 15 ActIniENRG 19-‐24 -‐0.041 0.014 yes 0.015 negligible 0.107 negligible 15 ActIniENRG >24 0.066 0.019 yes 0.092 negligible 0.107 negligible 16 SelfConAPP <19 0.019 0.013 no 16 SelfConAPP 19-‐24 -‐0.026 0.014 no 16 SelfConAPP >24 0.007 0.019 no 17 TimeMgtMANA <19 0 0.013 no 17 TimeMgtMANA 19-‐24 0.067 0.014 yes 0.067 negligible 0.134 negligible 17 TimeMgtMANA >24 -‐0.067 0.019 yes 0.067 negligible 0.134 negligible 18 SlfCompCOMM <19 -‐0.087 0.013 yes 0.123 negligible 0.138 negligible 18 SlfCompCOMM 19-‐24 0.036 0.014 yes 0.123 negligible 0.015 negligible 18 SlfCompCOMM >24 0.051 0.019 yes 0.138 negligible 0.015 negligible 19 AchMotPOSSI <19 -‐0.008 0.014 no 19 AchMotPOSSI 19-‐24 0.076 0.015 yes 0.084 negligible 0.144 negligible 19 AchMotPOSSI >24 -‐0.068 0.02 yes 0.06 negligible 0.144 negligible 20 IntFlexADAP <19 0.029 0.014 yes 0.073 negligible 0.013 negligible 20 IntFlexADAP 19-‐24 -‐0.044 0.015 yes 0.073 negligible 0.06 negligible 20 IntFlexADAP >24 0.016 0.02 no

*Effect size 1: for <19 = <19 compared with 19-‐24; for 19-‐24= 19 -‐24 compared with <19; for >24= >24 compared with 19-‐24 **Effect size 2: for <19 = < 19 compared with >24; for 19-‐24 = 19-‐24 compared with >24; for >24 = >24 compared with <19

***magnitude: negligible= ES<.426 intermediate=.426<ES>.638 large=ES>.638

117

Table D3 (continued) LEQ (Composite) -‐ DIF by Age Item Number Item Title category estimate error

sig >.05

effect size 1* magnitude

effect size 2** magnitude

21 TskLdMOTIVA <19 -‐0.004 0.013 no 21 TskLdMOTIVA 19-‐24 0.001 0.014 no 21 TskLdMOTIVA >24 0.003 0.019 no 22 EmotContWRO <19 0.002 0.013 no 22 EmotContWRO 19-‐24 -‐0.052 0.014 yes 0.054 negligible 0.102 negligible 22 EmotContWRO >24 0.05 0.019 yes 0.048 negligible 0.102 negligible 23 ActIniGETT <19 -‐0.037 0.013 yes 0.027 negligible 0.084 negligible 23 ActIniGETT 19-‐24 -‐0.01 0.014 no 23 ActIniGETT >24 0.047 0.019 yes 0.084 negligible 0.057 negligible 24 SelfConBELI <19 -‐0.083 0.062 no 24 SelfConBELI 19-‐24 -‐0.023 0.068 no 24 SelfConBELI >24 0.105 0.092 no

*Effect size 1: for <19 = <19 compared with 19-‐24; for 19-‐24= 19 -‐24 compared with <19; for >24= >24 compared with 19-‐24

**Effect size 2: for <19 = < 19 compared with >24; for 19-‐24 = 19-‐24 compared with >24; for >24 = >24 compared with <19 ***magnitude: negligible= ES<.426 intermediate=.426<ES>.638 large=ES>.638

119

APPENDIX E

120

http://wilderdom.com/tools/leq/wiki/doku.php?id=irt_analysis#draft_text_for_permission_letter

[[irt_analysis]] LEQ

Overview Kara Sammet is considering an Item Response Theory (IRT) analysis of the LEQ. This can hopefully provide an even more rigorous and insightful view of the utility of the items and suggest some useful ways forward for development of the LEQ and concepts related to measurement of change through experiential and outdoor education programs.

Draft Text for Permission Letter To whom it may concern:

1. I agree to freely share all existing LEQ data which I have rights and access to with Kara Sammet in good faith for the purposes of her dissertation project and for the purposes of any subsequent publication, etc.

2. Intellectual work performed on the data is the property of the person who did the work, although I would encourage and support endeavours to use open licensing and the public domain

3. I understand that analyses undertaken may not support the reliability and validity of the LEQ. I would encourage analysis and reporting to be as candid as possible, with a view to improving and developing the quality of available instrumentation.

4. I agree that Kara Sammet can freely publish and share the results of her dissertation and any derivative projects and publications.

Proposal Example: Marsh's Flow scale article → extending validity. Degree to which people can customise instruments LEQ-H (24) and LEQ-G (64) items

irt_analysis.txt · Last modified: 2007/09/05 08:20 (external edit)

arasch!itemresponsemodelingapproach!to!validation ... · 2018-10-10 ·...

Documents