psychometric issues in the measurement of non-cognitive ... · non-cognitive attributes. in...
TRANSCRIPT
1
Psychometric Issues in the Measurement of Non-Cognitive Attributes
Yoon Soo Park
University of Illinois – College of Medicine at Chicago
October 6, 2014
Correspondence concerning this manuscript should be addressed to Yoon Soo Park, Department of Medical Education, University of Illinois – College of Medicine at Chicago, 808 S. Wood Street, 986 CMET (MC 591) Chicago, IL 60612-7309. Email: [email protected].
2
Abstract
Recent research has demonstrated the impact that non-cognitive attributes have on long-term life outcomes, with studies supporting such evidence continuing to emerge across various disciplines. Non-cognitive attributes refer to character skills such as conscientiousness, motivation, and agreeableness that contrast from cognitive attributes which traditionally measure general knowledge or intelligence. Although investing in non-cognitive attributes has shown great promise, psychometric issues pertaining to its measurement characteristics deserve greater attention and discussion. Non-cognitive attributes face greater challenges in its measurement due to sampling of behaviors that require the use of sufficient cases, items, and raters, which complicates reliable and precise estimates. This paper uses psychometric rater models to refine measurements of non-cognitive attributes. Empirical analysis using teacher observation data from classroom settings demonstrate benefits of using this technique to refine measurements of non-cognitive attributes. In particular, when estimates from the psychometric rater models were analyzed with value-added scores, non-cognitive attributes had greater effect sizes, relative to traditional methods. This paper also proposes a new method for measuring non-cognitive attributes that account for modes of observations. Real-world data from police promotion exercises are used to demonstrate its use. Monte carlo simulations show stability in recovery of parameter estimates.
Key words: Non-cognitive attributes, hierarchical rater model, psychometrics, value-added models
3
1. Introduction
During the past decade, studies in disciplines ranging from economics, education,
medicine, and public health have underscored the value of non-cognitive attributes that have
long-term effects in predicting meaningful outcomes (e.g., Heckman and Kautz, 2013; Gutman
and Schoon, 2013; Park, Riddle, and Tekian, 2014; Abramson, Park, Stehling-Ariza, and
Redlener, 2010). Non-cognitive attributes refer to character skills, such as conscientiousness,
motivation, agreeableness, sociability, or perseverance; they contrast from cognitive attributes
that measure general knowledge or intelligence. Studies have shown that investing in non-
cognitive attributes during early childhood and elementary school programs significantly
contribute to a person’s future potential and life outcomes (Heckman, Pinto, and Savelyev, 2013).
Evidence supporting the effects of non-cognitive attributes has been noted in various contexts,
including the labor market, schools, community, and hospitals, where these qualities affect the
success of workplace performance and productivity (Almlund, Duckworth, Heckman, and Kautz,
2011). As such, proper measurement of non-cognitive attributes that minimize measurement
error and maximize precision has become an important subject of discussion and a focus of this
paper.
In the field of professions education, where highly-specialized workforce is trained (e.g.,
physicians, teachers, lawyers), the value of non-cognitive attributes has received careful attention.
For example, in the training of medical doctors, competence in communication and interpersonal
skills (CIS) is considered an essential element of good clinical practice – CIS have been linked
with patient and physician satisfaction, better medical decision, patient safety, and adherence to
treatment plans, which are known to be important patient outcomes, quality of healthcare, cost-
effective practices, and decreases in malpractice (Makoul and Curry, 2007; Vincent, Young, and
4
Philips, 1994). Therefore, valid and reliable measurement of non-cognitive attributes for health
professional trainees is at the forefront of medical licensure and maintenance of quality in the
United States and Canada. For teachers, research has shown that effective teaching requires
competence in non-cognitive attributes that measure proper management of classroom
environment and demonstration of professional responsibilities (Danielson, 2007); these
attributes surpass the impact that professional credentials and graduate degrees provide. When
combined with student achievement gains, non-cognitive attributes of teachers have shown to be
an effective mechanism for identifying qualified teachers (MET, 2012). For high-ranking police
offers and firefighters, the ability to be professional and communicate effectively have long been
noted as important competencies in their work and are included in their promotion assessments
(City of Columbus Civil Service Commission, 2012). As demonstrated in these examples,
emerging research has continuously emphasized the value of non-cognitive attributes as a
significant construct at the highest levels of specialized education and training across a variety of
fields.
While the impact of non-cognitive attributes on long-term outcomes has been recognized,
they have been largely neglected due to insufficient evidence supporting their reliability and
non-ignorable errors associated with their precision; these issues are known as reference bias or
measurement error (Heckman and Rubinstein, 2001; Groot, 2000). Instruments used to measure
non-cognitive attributes have been noted to be “subjective” (Jacob and Lefgren, 2008; Rockoff
and Speroni, 2010) and in many cases rely on self-reported surveys, which can affect their
response process and consequently their validity. When non-cognitive attributes are measured by
human raters, they face issues of rater severity or bias that can significantly affect the quality of
5
data (Park, Holtzman, and Chen, 2014). These issues contrast from instruments used to measure
cognitive attributes, such as achievement tests, which have fixed and unequivocal responses.
To rank order trainees by non-cognitive attributes for selection or for promotional
purposes, challenges posed by these measurement issues need to be examined. In this regard,
assessment researchers and testing organizations in the professions education have relied on
observing samples of behavior to measure non-cognitive attributes, while closely monitoring
their psychometric characteristics by studying their properties. This paper focuses on
measurement of non-cognitive attributes that use samples of trainee behavior evaluated by
human raters and psychometric techniques to overcome this problem.
Assessing how learners react to certain scenarios and observing their behaviors in real-
life circumstances have shown promise as a reliable and valid mechanism to measure non-
cognitive attributes (Jackson, 2013; Pratt and Cullen, 2000). Measurement precision, particularly
for non-cognitive attributes, is therefore a sampling issue – observations rely on sufficient
sampling of cases or items and the use of trained raters to score and evaluate trainees. For
example, in measuring teaching effectiveness, principals observe a teacher’s instructional ability
and gather evidence to generate a numeric score of their performance. In the case of physicians
training to practice independently, they are observed and evaluated by senior medical personnel
until they are certified. Although a scoring rubric is often provided with ample opportunities for
training, the use of raters makes an assumption that they are invariant to extraneous
circumstances. Non-cognitive attributes are generally measured using performance-based
assessments, requiring human judgment to evaluate their quality. This presents a challenge, as
the consistency and accuracy of rater judgments can become a critical issue, especially when
high-stakes decisions and consequences are confounded with rater bias. Therefore, improving the
6
measurement of non-cognitive attributes remains an important and meaningful discussion.
However, to date, the literature lacks methodological guidance and techniques that can refine
measures of non-cognitive attributes.
The purpose of this paper is to incorporate psychometric rater models to refine
measurements of non-cognitive attributes. Psychometric rater models that rely on latent variables
are used as the empirical strategy in this context. The application of psychometric rater models
will demonstrate the need to apply model-based approaches that account for item- and rater-
specific effects to estimate the non-cognitive attributes. The latent class signal detection theory
(LC-SDT) model used in educational measurement and mathematical psychology will form the
theoretical basis for this approach. Extensions of the LC-SDT to hierarchical structures are
proposed with empirical application and monte carlo simulations to demonstrate its utility and
stability in estimates. Teacher observations of effective teaching qualities taken from Chicago
Public Schools (CPS) Recognizing Educators Advancing Chicago Students (REACH) project
and the Columbus Police and Firefighter data are used to demonstrate the method.
This paper is organized as follows. First, an overview of the LC-SDT model and its
psychometric background are provided as rationale for incorporating this method to measure
non-cognitive attributes. Then, theoretical background for using hierarchical extensions of the
LC-SDT model that account for items, raters, and observational context are presented. The
ensuing sections use real-world data to present applications of the methods proposed. The CPS
teacher observation data is fit using the LC-SDT model to derive model-based non-cognitive
scores. Value-added scores are used to examine the utility of using psychometric rater models.
Finally, the Columbus Police and Firefighter data are used to fit hierarchical extensions of the
LC-SDT model with simulation studies that demonstrate the model’s identification, stability, and
7
estimation properties. Implications for future use and developments for measuring non-cognitive
attributes are provided, with potential policy implications.
2. Latent Class Signal Detection Theory (LC-SDT)
In the educational measurement and mathematical psychology literature, various models
have been developed to measure constructs that require human judgment. This paper uses this
technique for refining the measurement precision of non-cognitive attributes. The latent class
signal detection theory (LC-SDT) model (DeCarlo, 2002) is used as the empirical strategy for
analysis. A brief overview of the psychological background and parameterization of the model is
presented.
In the LC-SDT model, rating is conceptualized as a psychological process, where a
rater’s role is viewed as attempting to discriminate between latent classes of behaviors; the latent
classes are defined as ordinal performance categories from the scoring rubric. That is, for a non-
cognitive behavior with four performance categories, a rater’s task is to classify a specific
behavior from the trainee into one of the four latent scores. In fact, the role of a rater is to
discriminate between scores defined in the rubric, which is analogous to discriminating between
latent classes.
The latent class SDT model has two parameters that explain the response of a rater: (1)
discrimination (d) and (2) response criteria (ck). Rater discrimination (d) refers to the ability of a
rater to discriminate between latent classes of behaviors, and the response criteria (ck) represents
the internal criteria to which the rater uses to compare and judge the behaviors. Figure 1 presents
a graphical representation of the SDT, where four probability distributions of perceptions in
behaviors are illustrated. There are three response criteria locations in the figure. These locations
8
represent a rater’s criteria for judging a particular score. For example, if a behavior is thought to
be between c1 and c2, then the rater gives the behavior a “2.” However, if a rater perceives the
quality as over c2, but below c3, then the score now becomes “3.” As such, the response criteria
represent a decisional aspect of the rater. Furthermore, it can be inferred from this diagram that
by shifting c3 up, the rater becomes stricter, because this decreases the likelihood of getting a “4.”
Likewise, by shifting c1 down, the rater becomes more lenient, because this increases the chance
for a rater to assign a higher score. As noted, these shifts in raters’ criteria locations represent
rater effects, because they allow a rater to be lenient or strict. Furthermore, it can also accounts
for the shrinkage effect in that if the criteria location for c1 is shifted to the far left, then a rater’s
chance of assign a score of “1” becomes very low.
Figure 1. A Representation of SDT for Scoring Categories 1 to 4
The discrimination parameter (d) represents the distance between the probability
distributions and reflects a perceptual aspect of the rater. Rater discrimination presents how well
a rater discriminates between latent classes of behaviors. When the distance between
distributions is larger, the rater has greater discrimination between the latent classes, because this
means that the perceptions of each scoring category are more distinct. In other words, when d is
larger, there is less overlap between the distributions and less error in terms of a rater’s attempt
0 d 2d 3dc 1 c 2 c 3
"1" "2" "3" "4"
9
to classify a behavior. If the distance between distributions is small, the ability of a rater to
differentiate between two latent classes of behaviors becomes less clear.
More formally, for N items, J raters, and K discrete scores (such that 1≤ k≤K), the latent
class SDT model is expressed in Equation (1):
)()|Pr( cjjkcj dcFkY (1)
Here, Yj is rater j’s observed response, and F is the logistic cumulative distribution function. The
c represents the categorical latent classes, which are the discrete ordered scores of examinee
ability defined by the scoring rubric. One of the aims of latent class analysis is to make model-
based classifications into a latent class using the observed response patterns (Dayton, 1998;
Clogg, 1995). The posterior probability of the latent variable c can be used to measure the
quality of this classification. Two measures for classification accuracy are presented for these
purposes. These measures are used in this paper to reflect the accuracy of classification derived
from the latent class SDT model. First, the expected proportion of cases correctly classified (Pc)
is calculated as follows:
s
Jcsc NYYYnP /)],...,,|Pr(max[ 21 (2)
)Pr(max1
)Pr(max
c
ccP
(3)
Here, the s in Equation (2) indicates the unique response patterns, and sn corresponds to the
frequency of each pattern. Furthermore, ),...,,|Pr(max 21 Jc YYY is the maximum posterior
probability across the latent classes for a given response pattern, and N is the total number of
cases. In addition to the proportion correctly classified statistic ( cP ), the lambda statistic (λ) is
considered, which accounts for classification that can occur by chance. This statistic can be
10
important when there is a latent class with a large size. The calculation of the lambda statistic is
presented in Equation (3). Both proportion correctly classified (Pc) and the lambda statistic (λ)
are used in this study to study classification accuracy.
3. Hierarchical Rater Models, with Extensions for Modes of Observation
3.1 Modes of Observation
Scoring of performance assessment tasks can be based on different modes of observation,
which can include live (onsite) observations and post-exam videotaped observations. For
example, medical students are observed and scored based on their encounter with patients (van
der Vleuten and Swanson, 1990); the interactions with patients are often also videotaped for
post-exam evaluation. Surgeons performing technical operative skills are also assessed using live
observations and videotaped recordings (Vassiliou et al., 2007; Beard et al., 2005). In
professional and personnel testing, the use of different modes of observation has also become
more prevalent. Promotion of firefighters and police officers are based on a combination of
direct observations and post-exam review, where judges score candidates through live interaction
onsite and subsequently through video-based recordings (City of Columbus Civil Service, 2012).
More recently, in the K-12 assessment arena, evaluations of teaching effectiveness through
observations of teachers have raised interest in the use of videotaped classrooms to standardize
scoring (Bill and Melinda Gates Foundation, 2012). As these examples demonstrate,
assessments based on multiple modes of observations have gained increased use and interest in
the testing field.
Although live observations of examinee performance have been the traditional format of
testing performance assessment tasks, technological advances have provided the opportunity to
11
use videotaped observations that provide educators and testing agencies the promise of quality
control and also offer practical solutions to limitations in real-time assessments (Vivekananda-
Schmidt et al., 2007). Contrary to the rapid increase in multiple modes of observations to assess
examinees, there is limited research on the extent to which video-based observations can produce
psychometrically comparable scores when compared to direct and live observations. Furthermore,
it is unclear whether video-based observations may limit viewing and possibly distort
interpretation of affective qualities of the persons or situations. There is also research that
support relationship building between the examinee and the rater that may influence how
observers perceive encounters differently between live-interaction and videotaped observation
(Ryanet al., 1995). The effect of scoring based on different modes of observations has been an
understudied area in the educational measurement field. Yet, many non-cognitive attributes rely
on post-encounter video reviews for measurement.
3.2 Hierarchical Rater Models
Various measurement models have been proposed to examine performance assessments
that require the use of raters. A popular measurement model for performance assessments is
item response theory (IRT) model. However, using IRT to estimate examinee ability ignores
rater effects – as Tate (1999) noted, IRT model can confound rater effects with item effects. In
other words, IRT models cannot distinguish scores between severe or lenient raters, which can
affect item characteristics that are used to estimate examinee performance. When the scoring
process involves the use of both live observations and videotaped recordings, an additional layer
of consideration – mode of observation – may influence raters’ judgment process; that is,
confounding can also occur through the mode of observation, which can further complicate
measuring examinee ability.
12
These issues raise the need to “disentangle” effects associated with raters, mode of
observation, and items. A natural candidate for resolving the confounding effect between raters
and items has been the hierarchical rater model (HRM; Patz, Junker, Johnson, and Mariano, 2002;
Mariano, 2002). An HRM based signal detection rater model (HRM-SDT; DeCarlo, Kim, and
Johnson, 2011) has also been proposed to refine the original HRM. The idea behind these
hierarchical rater models is that scores assigned by raters become a direct indicator of
performance quality, which in turn, becomes an indicator of examinee ability.
The hierarchical structure of these models prevents confounding of items and rater effects.
In the HRM-SDT, a latent class signal detection theory (LC-SDT; DeCarlo, 2002) model is
specified as the rater model in level 1; the LC-SDT model in level 1 allows estimation of rater
characteristics such as severity and precision of raters. In level 2, the generalized partial credit
(GPC; Muraki, 1992) model is specified for the constructed response (CR) model to estimate
item characteristics such as item discrimination and item step parameters. In this study, an
additional level is specified to take into account the mode of observation, as an intermediary
level between the rater and item models.
3.3 Hierarchical rater model with LC-SDT
To examine differences in modes of observation, a hierarchical rater model (HRM) is
used. The HRM uses rater scores as indicators of performance quality, which thereby becomes
an indicator of examinee ability (DeCarlo, Kim, and Johnson, 2011). In the HRM-SDT, a signal
detection rater model is specified in level 1, which provides measures of rater precision and rater
effects. The LC-SDT model provides a measure of a rater’s precision in terms of how well they
discriminate between the latent classes (d). It also estimates their use of response criteria (ck),
13
which reflects rater effects such as how lenient or strict they score, as well as shrinkage and other
effects. In level 2, a polytomous IRT model is applied to estimate item parameters.
Equation (4) shows level 1 LC-SDT model for J raters and K discrete scores, such that
1≤k≤K:
)()|( ljljklljl dcFkYp (4)
Here, Yj is rater j’s observed response for item l, and F is the logistic cumulative distribution
function. The ηl represents the categorical latent classes, which are the discrete ordered scores of
examinee ability defined by the scoring rubric. In level 2, the generalized partial credit (GPC;
Muraki, 1992) model is specified as shown in equation (5):
]exp/[)]([exp)|(1
00
lg0
M ba
mlmll
gl
bap
(5)
Together, equations (4) and (5) can be combined to form equation (6), which becomes the HRM-
SDT.
dppYpYp )()|(),|()( (6)
3.4 Hierarchical Rater Model for Modes of Observation (HRM-MO)
This study proposes an extension to the HRM-SDT, which includes a model for
examining the quality of observation mode by applying the LC-SDT model. In level 1, rater
scores (Y) are indicators for latent categorical performance quality (η), which in turn, becomes
an indicator for the latent categorical quality of observation mode (Φ) in level 2; the quality of
observation mode becomes an indicator for examinee ability (θ) in level 3.
In the HRM-MO, the LC-SDT model is used for levels 1 (rater model) and 2 (mode of
observation model). The same equation (1) is used for level 1. In level 2, the following equation
14
(7) is used to model the mode of observation, where the parameter f indicates the criteria
associated with the mode of observation and the parameter h indicates how well the quality of
performances are discriminated within each latent class of observation modes (o).
)()|( ololoklol hfFp (7)
In level 3, the following equation, based on the GPC model is used to estimate item parameters,
where the parameter a indicates item discrimination, and the parameter b indicates the category
steps.
]exp/[)]([exp)|(1
00
lg0
M ba
m
lmllg
l
bap (8)
Together, equation (9) presents the HRM-MO, which will be examined in this study.
dpppYpYp )()|()|,(),,|()(,
(9)
Assumptions of conditional independence are made at each level, to simplify equations at each
level:
)|()|(),,|( YpYpYp (10)
)|(),(),|( ppp (11)
)|()|( pp (12)
The complete HRM-MO model can be obtained by substituting equations (10), (11), and (12)
into equation (9) and using probabilities from equations (4), (7), and (8).
15
4. Chicago Public Schools (CPS) Teacher Observation Data
4.1 Methods
Data. Empirical data for the Chicago Public Schools (CPS) teacher observation was
collected during the 2012-2013 academic year, as part of the teacher evaluation system recently
implemented in its legislative policies (Title 23: Education and Cultural Resources, Part 50
Evaluation of Certified Employees under Article 24A and 34 of the School Code). A total of
1,000 teacher evaluations were collected (total original sample was 1,060, of which 1,000 was
used for this study due to missing data and other sampling issues), where a principal observed a
teacher, jointly with an Instructional Effectiveness Specialist (IES). Evaluations measured two
non-cognitive attributes: Classroom Environment and Instruction. Classroom Environment
corresponds to Domain 2 of the CPS Framework for Teaching, which measures the teacher’s
ability to “create an environment of respect and rapport,” “establishing a culture for learning,”
“managing classroom procedures,” and “managing student behavior.” Instruction corresponds to
Domain 3, which measures “communicating with students,” “using questions and discussion
techniques”, “engaging students in learning”, “using assessment in instruction”, and
“demonstrating flexibility and responsiveness.” Details of the observation instrument used to
measure teacher’s non-cognitive attributes can be found in the Chicago Public School’s Teacher
Evaluation Plan and Handbook of Procedures (Chicago Public Schools, 2012). Teachers were
observed by principal-IES pairs. A total of 96 principals and 19 IES raters participated in the
study data.
Analysis. The CPS REACH data were fit using the LC-SDT model, as specified in
Equation (1). To avoid common boundary estimation problems among latent class models in
numerical computation, a partly Bayesian approach using posterior mode estimation was used
16
(McLachlan and Krishnan, 2008). In this approach, a prior specification of Bayes constants (α1)
is used to smooth boundaries with estimation issues (DeCarlo, Kim, and Johnson, 2012). Bayes
constants can be interpreted as adding α1 observations to the data. If α1 are set equal to zero, log
p(θ) = 0, which will obtain maximum likelihood estimates: 00 |
0
1| log][log
uu zxzx UKp
. Here,
K denotes the number of latent classes (behaviors). The influence of the prior is equivalent to
adding α1 / K cases to each latent class (Park and Lee, 2014).
To examine rater (observer) characteristics, parameter estimates (ckj and dj) from the
model were plotted. Post-hoc comparisons between raters, particularly between principals and
IES raters, were made using discrimination parameters that represent rater precision. Estimated
model parameters were used to derive model-based scores. Model-based scores that account for
rater characteristics were compared with original ratings provided by principals and IES raters
using measures of agreement (% exact agreement, kappa, quadratically-weighted kappa). Finally,
value-added scores (standardized to –3 and 3 scale) for mathematics, reading, and combined
subjects were used as covariates to regress their effect on the latent classes, logit
ZhbZp kk )|( , where Z represents the vector of covariates. Estimates of hk were compared
with traditional linear regression coefficients using original principal and IES ratings.
4.2 Results
Descriptive statistics. Over 85% of the rating distribution was concentrated in the
middle categories of “3” and “4”. Table 1 shows the distribution of ratings by component and
domain. Since each teacher observed was double scored by a principal-IES pair, their agreement
was examined. Non-cognitive attributes in Domain 2 (Classroom Environment) had about 75%
exact agreement, while attributes in Domain 3 (Instruction) had about 66% exact agreement.
17
Table 2 shows measures of agreement using kappa and weighted kappa, which accounts for
chance agreement and penalizes for larger discrepancies between the pairs. The table is also
stratified by tenured teacher and non-tenured teachers, as often examined by content experts.
Table 1. Descriptive statistics
Domain Component Rating Performance Category (%) “1” “2” “3” “4”
Classroom Environment (Domain 2)
Response and Rapport 1.70 19.49 62.02 16.79 Culture of Learning 2.65 30.73 53.87 12.74 Managing Procedures 3.36 26.27 60.30 10.08 Managing Behavior 3.35 26.91 59.23 10.51
Instruction (Domain 3)
Communication 3.75 31.57 54.88 9.80 Questioning and Discussion 8.37 51.40 35.02 5.21 Engaging in Learning 6.95 44.28 42.18 6.60 Assessment in Instruction 9.83 48.77 37.18 4.21 Flexibility and Responsiveness 8.46 46.99 39.01 5.53
Note: Values represent row percentages. A total of 1,000 observations were scored by principals and IES raters using a 4-point rating scale (CPS Framework for Teaching, see http://cps.edu/sitecollectiondocuments/cpsframeworkteaching.pdf for a full description of the rubric. Document accessed on October 1, 2014).
Model parameter estimates. Figure 2 shows the LC-SDT model parameters, restricted
to IES raters (complete table of results for all 115 raters can be obtained from the author). In the
left figure, rater precision estimates and their respective 95% confidence intervals are plotted.
The X-axis represents the 19 IES raters and the Y-axis represents the rater precision estimates (dj
from Equation [1]). Results show a wide variability in rater precision even among IES raters. IES
raters are highly-trained observers who visit schools to work with principals to improve their
teacher evaluation skills. The wide variability in rater precision estimates indicates the need to
adjust for rater-specific differences, which are often ignored in practice. In the figure to the right,
plots of the rater criteria (ckj) are presented. Since the CPS Framework for Teaching is based on 4
ordinal performance categories, there are three criteria locations. A criteria estimate that is higher,
relative to other raters indicates greater severity; lower estimates indicate leniency.
18
Table 2. Measures of agreement between principals and IES raters
Data Domain Component % Exact
Agreement Kappa
(Unweighted)Kappa
(Linear) Kappa
(Quadratic)
All teachers (n=1,000)
Classroom Environment
Response and Rapport 75.42% .55 (.02) .60 (.02) .67 (.03) Culture of Learning 75.69% .59 (.02) .63 (.02) .68 (.03) Managing Procedures 76.35% .58 (.02) .62 (.02) .69 (.03) Managing Behavior 73.11% .53 (.02) .59 (.02) .66 (.03)
Instruction
Communication 71.54% .52 (.02) .57 (.02) .64 (.03) Questioning and Discussion 68.47% .48 (.02) .53 (.02) .58 (.03) Engaging in Learning 66.11% .46 (.02) .51 (.02) .58 (.03) Assessment in Instruction 66.80% .47 (.02) .52 (.02) .59 (.03) Flexibility and Responsiveness 60.16% .37 (.02) .44 (.02) .51 (.03)
Non-tenured Teacher observations(n=503)
Classroom Environment
Response and Rapport 76.54% .55 (.03) .59 (.03) .66 (.04) Culture of Learning 78.64% .62 (.03) .64 (.03) .67 (.04) Managing Procedures 77.76% .60 (.03) .65 (.03) .71 (.04) Managing Behavior 75.85% .57 (.03) .61 (.03) .67 (.04)
Instruction
Communication 74.05% .54 (.03) .58 (.03) .64 (.04) Questioning and Discussion 66.53% .43 (.03) .47 (.03) .53 (.04) Engaging in Learning 68.66% .48 (.03) .51 (.03) .54 (.04) Assessment in Instruction 68.07% .46 (.03) .50 (.03) .55 (.04) Flexibility and Responsiveness 61.23% .38 (.03) .43 (.03) .50 (.04)
Tenured teacher observations(n=497)
Classroom Environment
Response and Rapport 74.09% .54 (.03) .60 (.03) .67 (.04) Culture of Learning 72.93% .56 (.03) .62 (.03) .69 (.04) Managing Procedures 75.20% .55 (.03) .60 (.03) .66 (.04) Managing Behavior 70.24% .48 (.03) .55 (.03) .64 (.04)
Instruction
Communication 69.23% .50 (.03) .55 (.03) .63 (.04) Questioning and Discussion 70.53% .53 (.03) .57 (.03) .61 (.04) Engaging in Learning 64.11% .44 (.03) .51 (.03) .60 (.04) Assessment in Instruction 65.31% .46 (.03) .53 (.03) .62 (.04) Flexibility and Responsiveness 58.95% .36 (.03) .43 (.03) .51 (.04)
Note: IES raters are “Instructional Effectiveness Specialists” who provide guidance on rater training to principals. Values in parentheses are standard errors.
19
Note:
1. Figure on left shows estimates of rater precision (dj estimates) across the 19 IES raters. A greater rater precision estimate reflects greater ability for the rater to discriminate differences between behaviors.
2. Figure on the right shows relative criteria estimates for the 19 IES raters. Since there are 4 categories in the rubric, there are 3 criteria locations (cut points) in the distribution. Criteria estimates were standardized to the same scale to make comparisons between raters. Higher criteria estimate indicates severity, while a lower estimate indicates leniency.
Figure 2. Parameter estimates from LC-SDT: Rater precision and relative criteria
01
23
45
67
8R
ater
Pre
cisi
on
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Rater
Rater Precision
-.4-.2
0.2
.4.6
.81
1.2
1.4
Rel
ativ
e C
riter
ia
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Rater
Relative Criteria
20
Comparing model-based scores with original rater scores. Based on model estimated
parameters, model-based scores were generated. Value-added scores (combined subjects,
mathematics, and reading) were regressed simultaneously to the estimated latent classes. In
comparison, traditional linear regression was used to examine the regression coefficient effects.
Table 3 presents the results comparing the two methods.
Results show that when a psychometric rater model is used, the coefficients of the value-
added scores have greater effect sizes. For example, in the combined value-added scores, the
regression coefficient for the latent class regression is .15, while it is .09 for linear regression.
While this difference is modest, with similar standard error estimates, the difference in effect
sizes indicate some value in using psychometric rater models to refine measurement precision of
non-cognitive attributes.
Table 3. Comparison of coefficient effects: Latent class regression and linear regression
Value-added score Latent class regression coefficients using
model-based Scores Linear regression coefficients
using original ratings Combined .154 (0.048)** .093 (.039)* Mathematics .259 (0.050)*** .166 (.036) Reading –.009 (0.043) –.005 (.035)
Note: Value-added scores standardized to –3 and 3 scale (see Value-Added Research Center, 2014). *p<.05; **p<.01; ***p<.001.Values in parentheses represent standard errors.
5. Columbus Police and Firefighter Promotion Data
5.1 Methods
Data. In this section, data were analyzed from a real-world administration of live and
video-recorded observation scores, where two exercises (items) are given to candidates and 6
different raters to score each exercise, comprising a total of 12 raters. For each exercise, 3 raters
score the candidate through live observation, with possible interactions between the examinee
and the raters; the remaining 3 raters score a video recording of the performance at a subsequent
21
time. In other words, raters 1, 2, and 3 score exercise 1 through live observation; raters 4, 5, and
6 score exercise 1 through videotaped recording. Similarly, raters 7, 8, and 9 score exercise 2
through live observation; raters 10, 11, and 12 score exercise 2 through videotaped recordings.
All raters were trained to score using a holistic 3-point rating scale, which measures the
following skills: oral communication, interpersonal relations, information analysis, and problem
sensing and resolution ability. The data contain 440 global ratings from each rater for each
exercise.
Analysis. Data were used to fit both HRM-SDT and HRM-MO. Model fit indices,
parameters, latent class size, and classification indices were compared. Estimation was
conducted using Latent Gold 4.5 (Vermunt and Magidson, 2005).
5.2 Results
Descriptive statistics. Table 4 shows the descriptive statistics of the ratings as well as the
rater agreement statistics for each mode of observation.
Table 4. Distribution of scores assigned and rater agreement
Exercise Mode Rater Score Assigned (%) Rater Agreement
“1” “2” “3” Kappa Weighted Kappa
1
(n=440)
Live (onsite) 1 16.82 35.00 48.18
.51 .66 2 14.09 44.09 41.82 3 11.82 36.82 51.36
Video 4 14.77 41.14 44.09
.38 .56 5 15.00 42.95 42.05 6 15.45 49.55 35.00
Combined (overall) 14.66 41.59 43.75 .36 .52
2
(n=440)
Live (onsite) 7 7.50 41.59 50.91
.50 .62 8 3.86 36.14 60.00 9 6.36 37.27 56.36
Video 10 16.36 43.18 40.45
.37 .57 11 9.09 47.95 42.95 12 16.59 45.00 38.41
Combined (overall) 9.96 41.86 48.18 .31 .49 Note: Each rater assigned scores for 440 observations on a 3-point scale. “Weighted kappa” used quadratic weights. Scores assigned (%) indicates row percentages.
22
Results indicate that less than 15% of the scores received “1” for exercise 1, while less than 10%
of scores received “1” for exercise 2. Similar proportions of scores were assigned for “2” and “3”
across both exercises.
Rater agreement was greater for live observations than for video-based observations.
Kappa and quadratically weighted kappa were used to measure rater agreement. Kappa takes into
account agreement that can occur by chance (Cohen, 1960; Cohen, 1968). The weighted kappa
penalizes larger discrepancies between raters more than smaller discrepancies (Shaeffer, Briel,
and Fowles, 2001). For exercise 1, the kappa and weighted kappa were .51 and .66 for live
observations and .38 and .56 for video-based observations, respectively. A similar trend was
found for exercise 2, where the kappa and weighted kappa were .50 and .62 for live observations
and .37 and .57 for video-based observations, respectively. The combined rater agreement
measures across modes of observations decreased.
Model fit, latent class sizes, and classification. Table 5 shows the model fit comparison
between the HRM-SDT and HRM-MO.
Table 5. Model comparison using information criteria
Fit Index HRM-SDT HRM-MO # of parameters 42 54
–2LL 8,101.12 7,866.50 AIC 8,185.12 7,974.50 BIC 8,356.77 8,195.18
Note: “HRM-SDT” is the hierarchical rater model with the latent class signal detection theory model as the rater model (DeCarlo, Kim, and Johnson, 2011). “HRM-MO” is an extension of the HRM-SDT with an additional level for mode of observation. Results of the model comparison for the real-world data indicate a better model fit for the HRM-
MO model (lower AIC and BIC). Table 6 shows the classification indices, Pc and λ.
Classification indices indicate the quality of classification based on posterior probabilities of the
model. The Pc measures classification accuracy, and the λ statistic accounts for classification that
23
can occur by chance (Clogg and Manning, 1996). For both exercises, classification was lower for
the video-based observation (η12 and η22) when compared to live observation (η11 and η21). In
addition, classification was lower for the combined latent categorical variables (Φ1 and Φ2).
Latent class sizes for the latent categorical variables are also presented in Table 6.
Table 6. Classification indices and latent class sizes by model
Model Latent
variable Classification Latent class sizes Pc λ Class 1 Class 2 Class 3
HRM-SDT η1 .93 .87 .16 (.02) .47 (.03) .37 (.03) η2 .92 .82 .16 (.02) .32 (.03) .52 (.04)
HRM-MO
η11 .92 .87 .20 (.03) .38 (.03) .42 (.03) η12 .89 .79 .15 (.02) .49 (.05) .35 (.05) η21 .90 .82 .17 (.03) .33 (.03) .50 (.04) η22 .88 .75 .14 (.02) .40 (.04) .46 (.04) Φ1 .91 .82 .14 (.03) .54 (.05) .32 (.05) Φ2 .86 .74 .19 (.03) .45 (.05) .36 (.05)
Note: Proportion correctly classified (Pc) and λ are both measures of classification based on posterior probability (Clogg, 1995). The λ statistic accounts for classification that can occur by chance. Values in parenthesis represent standard errors.
Level 1: Rater model parameters. Table 7 shows level 1 rater parameters by HRM-
SDT and HRM-MO models. Results indicate that rater discrimination (d parameter), which
indicates how well a rater is able to discriminate between different qualities of performance
(rater precision), was generally greater for live (onsite) scoring than video-based scoring for both
exercises, where the difference was slightly greater for exercise 1 than exercise 2. However, the
average rater discrimination between the two exercises was comparable. The distribution of rater
discrimination indices shows raters that are able to better detect differences between the
categories. Although estimates were different, the overall trends between the HRM-SDT and
HRM-MO were similar.
Figures 3 (right: discrimination, left: relative criteria) was created to visually illustrate the
parameters for the HRM-MO model.
24
Table 7. Rater parameters: Level 1 (Signal Detection Theory Rater Model) by model Exercise Mode Rater Parameter HRM-SDT HRM-MO
1
Live (Onsite)
1 c11 1.20 (.31) 1.30 (.37) c12 4.43 (.38) 6.05 (.62) d1 3.55 (.30) 4.65 (.48)
2 c21 .66 (.27) .54 (.28) c22 4.62 (.38) 5.49 (.49) d2 3.26 (.26) 3.66 (.30)
3 c31 .41 (.28) .05 (.25) c32 4.04 (.38) 3.75 (.38) d3 3.42 (.31) 3.04 (.29)
Video
4 c41 .22 (.24) .60 (.28) c42 3.14 (.30) 3.85 (.38) d4 2.25 (.21) 2.83 (.26)
5 c51 .37 (.24) .78 (.28) c52 3.53 (.32) 4.34 (.39) d5 2.47 (.23) 3.12 (.31)
6 c61 .43 (.25) .97 (.32) c62 4.02 (.35) 5.18 (.52) d6 2.47 (.22) 3.29 (.30)
2
Live (Onsite)
7 c71 .40 (.26) .41 (.29) c72 4.38 (.53) 4.90 (.57) d7 3.03 (.29) 3.50 (.33)
8 c81 1.30 (.29) 1.36 (.30) c82 3.59 (.44) 3.66 (.48) d8 3.02 (.31) 3.20 (.39)
9 c91 .55 (.26) .55 (.30) c92 4.64 (.55) 6.31 (.96) d9 3.55 (.35) 5.10 (.93)
Video
10 c101 .70 (.26) 1.36 (.39) c102 4.09 (.41) 5.41 (.65) d10 2.41 (.23) 3.34 (.34)
11 c111 .15 (.25) .26 (.31) c112 4.59 (.46) 6.05 (.65) d11 2.79 (.26) 3.86 (.38)
12 c121 .32 (.22) .71 (.27) c122 3.24 (.31) 3.98 (.38) d12 1.82 (.18) 2.36 (.22)
Note: Values in parenthesis represent standard errors. Parameter d represents rater discrimination and c represents rater criteria. “HRM-SDT” is the hierarchical rater model with the latent class signal detection theory model as the rater model (DeCarlo, Kim, and Johnson, 2011). “HRM-MO” is an extension of the HRM-SDT with an additional level for mode of observation.
25
Note:
1. In the left figure, the X-axis indicates the rater IDs; the Y-axis indicates relative criteria estimates. Raters 1 to 3 and 6 to 9 scored onsite (live scoring) for exercises 1 and 2, respectively. Raters 4 to 6 and 10 to 12 scored using a video for exercises 1 and 2, respectively. Horizontal lines were added on criteria location where the likelihood ratios are maximized as reference points.
2. In the right figure, the X-axis indicates rater IDs; the Y-axis indicates rater discrimination estimates. Raters 1 to 3 and 6 to 9 scored onsite (live scoring) for exercises 1 and 2, respectively. Raters 4 to 6 and 10 to 12 scored using a video for exercises 1 and 2, respectively.
Figure 3. Plots of relative criteria by rater characteristics
26
In Figure 3 (left), the relative criteria for the 12 raters are presented. Relative criteria are
standardized estimates of rater effects that allows comparison between raters (direct comparisons
of c parameters between raters in Table 7 is not accurate, due to differences in d parameters,
which needs to be standardized). The X-axis indicates rater IDs and the Y-axis presents the
relative criteria locations that have been standardized by accounting for differences in rater
discrimination. Since there are three categories, there are two criteria locations per rater.
Horizontal lines were added to provide reference points in the criteria where the likelihood ratios
are maximized, meaning higher rater criteria location indicates leniency and lower location
indicates severity. In general, all raters were severe in their use of the lower scoring category, as
indicated by the relative criteria estimates below the horizontal line.
Figure 3 (right) shows the rater discrimination estimates by rater. The X-axis represents
the rater IDs, and the Y-axis represents the rater discrimination estimates. As indicated in Table
7, rater discrimination was generally higher for live observations. Moreover, rater 12 had the
lowest rater discrimination, indicating lower ability to discriminate differences between the
qualities of performance demonstrated by the examinees.
Level 2: Mode of observation parameters. Table 8 shows the level 2 parameters,
pertaining to the quality of observation mode. Similar to level 1, the LC-SDT model was used to
estimate differences in the quality of latent categorical scores between modes of observation. The
f parameter indicates mode effect, similar to the c parameter that indicated rater effects. The h
parameter, similar to the d parameter, indicates how well the mode of observation was used to
discriminate differences between latent qualities of examinee performance.
Results indicate that for exercise 1, the h parameter was greater for video-based
recordings than for live observations. For exercise 2, the live observation had slightly greater
27
discrimination than video-based recordings. These results may indicate that video-based
observations were better at discriminating different qualities of examinee performance than live
observations for exercise 1. Relative criteria based on the f parameter were similar between the
different modes of observations.
Table 8. Mode of observation parameters: Level 2 (Signal Detection Model) Mode of observation Parameter Exercise 1 Exercise 2
Live observation f11 2.19 ( .77) 1.97 (1.04) f12 4.83 ( .89) 6.68 (1.79) h1 3.77 ( .76) 5.86 (1.67)
Video-based observation f21 2.68 (1.26) .86 ( .46) f22 8.20 (1.53) 5.30 (1.29) h2 5.75 (1.17) 4.24 (1.24)
Note: Values in parenthesis represent standard errors. Parameter h represents discrimination and f represents criteria. Combining results from level 1 and level 2, the estimates seem to indicate that raters assigned to
score live observations were more precise (higher rater discrimination) than raters assigned to
score video-based recordings. However, between the two modes of observations, video-based
recordings allowed greater discrimination of differences in quality than live observations for
exercise 1.
Level 3: Item parameters. Table 9 presents the item parameters for the two exercises by
HRM-SDT and HRM-MO models.
Table 9. Item parameters: Level 3 (Generalized partial credit model)
Parameter Exercise 1 Exercise 2
HRM-SDT HRM-MO HRM-SDT HRM-MO a 1.67 (.64) 2.53 (.84) 2.04 (1.03) 2.63 (.97) b1 –2.16 (.63) –3.29 (.92) –2.34 (1.04) –2.70 (.86) b2 .68 (.28) 1.38 (.57) –.32 ( .26) 1.04 (.58)
Note: Values in parenthesis represent standard errors. Parameter a represents item discrimination and b represents category step parameter based on the generalized partial credit model (Muraki, 1992). “HRM-SDT” is the hierarchical rater model with the latent class signal detection theory model as the rater model (DeCarlo, Kim, and Johnson, 2011). “HRM-MO” is an extension of the HRM-SDT with an additional level for mode of observation.
28
Between the two HRMs, the HRM-MO had greater item discrimination (a) estimates. Moreover,
the step category (b) parameters were spaced further apart. However, the general trends in the
parameters were similar, with slightly greater estimates of item discrimination for exercise 2.
6. Monte Carlo Simulation Study
6.1 Methods
Monte carlo simulations were conducted to examine the sensitivity of the HRM-MO
model under varying sample sizes of 100, 400, and 1,000 for two exercises scored on two modes
of observation (i.e., live observation and videotaped observation) with three raters each,
following the same data structure in the Columbus examination. The sample sizes were designed
to account for realistic numbers of examinees who take the promotion exam in the real-world
data. Although possible, it would be extremely rare that over 1,000 examinees will be tested
simultaneously in a national setting for the particular exam analyzed in Study 1.
Three conditions were used to generate data. Population values (generating values)
associated with these conditions are presented in Table 10. In condition 1, all raters are assumed
to have the same rater parameters, and item parameters are also the same; only the level 2
parameters (mode of observation level) differ. In condition 2, item parameters in level 3 are
different, in addition to different level 2 parameters. In condition 3, item, mode of observation,
and raters have different parameter estimates. The motivation for these different conditions is to
examine the effect of parameter recovery at each level. Results from the real-world analysis in
Study 1 indicated that parameters from all three levels could vary. Given the three parameter
conditions presented in Table 10 and the three sample size sets, there were 9 total conditions
examined in the simulation study (9 total conditions = 3 parameter conditions in Table 7 x 3
sample size conditions).
29
Table 10. Conditions for simulation: Generating values
Level Exercise Type Parameter Condition 1 Condition 2 Condition 3
Level 3: CR item model Generalized partial credit model
1 Exercise 1 b11 –1.5 –1.5 –1.5 b12 1.5 1.5 1.5 a1 2.0 2.0 2.0
2 Exercise 2 b21 –1.5 –2.0 –2.0 b22 1.5 2.0 2.0 a2 2.0 3.0 3.0
Level 2: Mode of observation model Latent class signal detection theory model
1
Live (onsite) f111 1.5 1.5 1.5 f112 4.5 4.5 4.5 h11 3.0 3.0 3.0
Video f121 2.5 2.5 2.5 f122 7.5 7.5 7.5 h12 5.0 5.0 5.0
2
Live (onsite) f211 2.5 2.5 2.5 f212 7.5 7.5 7.5 h21 5.0 5.0 5.0
Video f221 2.5 2.5 2.5 f222 7.5 7.5 7.5 h22 5.0 5.0 5.0
Level 1: Rater model Latent class signal detection theory model
1
Rater 1: Live observation
c11 2.0 2.0 1.5 c12 6.0 6.0 4.5 d1 4.0 4.0 3.0
Rater 2: Live observation
c21 2.0 2.0 2.0 c22 6.0 6.0 6.0 d2 4.0 4.0 4.0
Rater 3: Live observation
c31 2.0 2.0 2.5 c32 6.0 6.0 7.5 d3 4.0 4.0 5.0
Rater 4: Video
c41 2.0 2.0 1.5 c42 6.0 6.0 4.5 d4 4.0 4.0 3.0
Rater 5: Video
c51 2.0 2.0 2.0 c52 6.0 6.0 6.0 d5 4.0 4.0 4.0
Rater 6: Video
c61 2.0 2.0 2.5 c62 6.0 6.0 7.5 d6 4.0 4.0 5.0
2
Rater 7: Live observation
c71 2.0 2.0 1.5 c72 6.0 6.0 4.5 d7 4.0 4.0 3.0
Rater 8: Live observation
c81 2.0 2.0 2.0 c82 6.0 6.0 6.0 d8 4.0 4.0 4.0
Rater 9: Live observation
c91 2.0 2.0 2.5 c92 6.0 6.0 7.5 d9 4.0 4.0 5.0
Rater 10: Video
c101 2.0 2.0 1.5 c102 6.0 6.0 4.5 d10 4.0 4.0 3.0
Rater 11: Video
c111 2.0 2.0 2.0 c112 6.0 6.0 6.0 d11 4.0 4.0 4.0
Rater 12: Video
c121 2.0 2.0 2.5 c122 6.0 6.0 7.5 d12 4.0 4.0 5.0
Note: Samples sizes of 100, 400, and 1,000 were used across the three conditions.
30
Data were generated using Stata 12 and fit using Latent Gold 4.5 using posterior mode
estimation and Bayes’ constants. 100 replications of data were generated. Summary of parameter
results were examined using bias and mean squared error (MSE).
6.2 Results
Table 11 presents the parameter recovery by condition and sample size. Results were
averaged across parameters at each level and presented in terms of bias, % bias, and MSE. For
condition 1, bias was greatest for level 2 parameters, which differed for modes of observation for
exercise 1. However, for sample size of 400, % bias was less than 5.3% for level 2 parameters.
In condition 2, data were generated to have different item parameters between the exercises; the
same mode of observation parameters was preserved from condition 1. Bias decreased to less
than 5% for a sample size of 400 for level 1 and 2 parameters. However, level 3 parameters still
had % bias over 20%, even with sample size of 1,000. Finally, in condition 3, which allowed all
parameters to vary, including level 1 rater parameters, the % bias results were similar to
condition 2. Even with a sample size of 1,000, the level 3 item parameters had % bias over 20%.
MSE estimates decreased with larger sample sizes.
Tables 12 and 13 present latent class sizes by sample size and condition and classification
indices, respectively. Results from these simulation studies indicate that the recovery of latent
class sizes and classification of the HRM-MO are within range for real-world data applications
for measuring non-cognitive attributes. Overall, results from the simulation study indicate that
the greatest bias occurred from item parameters in level 3. Rater parameters (level 1) and modes
of observation parameters (level 2) were not largely biased even with differences in population
values.
31
Table 11. Parameter recovery by condition and sample size
Condition Level Parametersn=100 n=400 n=1,000
Bias % Bias MSE Bias % Bias MSE Bias % Bias MSE
1
3: Item b –.005 11.4% .394 .009 3.8% .131 –.008 1.8% .060 a –.127 6.8% .312 .017 .9% .070 .012 .6% .029
2: Mode f .057 11.8% 1.426 –.162 5.1% .903 –.039 3.0% .473 h –.033 9.0% .919 .117 5.3% .550 .069 3.1% .317
1: Rater c –.122 3.2% .469 –.021 .9% .102 –.011 .7% .041 d .112 2.9% .464 .027 .9% .098 .009 .5% .037
2
3: Item b .066 27.0% .529 .020 17.2% .276 .028 18.7% .171 a –.354 22.9% .639 –.118 21.6% .411 –.134 21.8% .339
2: Mode f .017 10.8% 1.415 –.127 4.1% .836 –.120 4.2% .458 h –.071 8.5% .820 .113 4.1% .472 .094 3.5% .264
1: Rater c –.150 3.6% .515 –.033 1.1% .107 –.015 .7% .039 d .134 3.2% .528 .027 .8% .099 .011 .5% .036
3
3: Item b .002 22.5% .613 –.009 17.4% .242 –.019 17.5% .153 a –.360 23.7% .644 –.157 22.0% .402 –.139 21.5% .317
2: Mode f .016 10.1% 1.299 –.150 3.4% .734 –.097 2.2% .405 h –.031 7.9% .815 .155 3.6% .426 .118 2.7% .262
1: Rater c –.139 3.3% .549 –.034 1.0% .129 –.010 .6% .048 d .117 2.8% .538 .036 1.0% .121 .012 .5% .047
Note: Bias, % Bias, and mean squared error (MSE) represent mean estimates for the parameters at each level. 100 replications were
used in the simulation. Bias was defined as follows:
N
n
N
nnn xexe
Nxexe
NxBias
1 1)()(
1)]()([
1)( . MSE was defined as
follows:
N
nn xexe
NxMSE
1
2)]()([1
)( .
32
Table 12. Latent class sizes by condition and sample size
Condition Latent
Variablen=100 n=400 n=1,000
Class 1 Class 2 Class 3 Class 1 Class 2 Class 3 Class 1 Class 2 Class 3
1
η11 .29 (.05) .41 (.06) .29 (.05) .30 (.03) .42 (.03) .29 (.02) .29 (.02) .42 (.02) .29 (.02) η12 .28 (.05) .43 (.06) .28 (.05) .28 (.03) .44 (.03) .28 (.02) .28 (.02) .44 (.02) .28 (.02) η21 .28 (.05) .45 (.06) .28 (.05) .29 (.02) .43 (.03) .28 (.02) .28 (.02) .44 (.02) .28 (.01) η22 .28 (.05) .44 (.06) .28 (.05) .28 (.02) .44 (.03) .28 (.03) .28 (.01) .44 (.02) .28 (.01) Φ1 .28 (.07) .44 (.09) .28 (.07) .27 (.04) .46 (.05) .27 (.04) .27 (.03) .47 (.03) .27 (.03) Φ2 .28 (.06) .45 (.06) .27 (.05) .28 (.03) .44 (.04) .28 (.03) .27 (.02) .46 (.03) .27 (.02)
2
η11 .29 (.05) .41 (.06) .30 (.05) .29 (.03) .41 (.03) .29 (.03) .29 (.01) .42 (.02) .29 (.02) η12 .28 (.05) .43 (.05) .29 (.05) .28 (.02) .44 (.02) .28 (.02) .28 (.01) .44 (.02) .28 (.02) η21 .28 (.05) .43 (.05) .29 (.05) .28 (.02) .43 (.03) .29 (.03) .29 (.02) .43 (.02) .29 (.01) η22 .28 (.04) .43 (.06) .29 (.05) .28 (.03) .43 (.03) .29 (.02) .28 (.02) .43 (.02) .29 (.01) Φ1 .28 (.06) .42 (.07) .30 (.07) .27 (.04) .45 (.05) .28 (.04) .26 (.03) .47 (.03) .27 (.02) Φ2 .28 (.04) .42 (.06) .30 (.06) .28 (.03) .44 (.04) .28 (.03) .27 (.02) .45 (.03) .28 (.02)
3
η11 .29 (.05) .42 (.06) .29 (.05) .29 (.03) .42 (.03) .29 (.03) .29 (.02) .42 (.02) .29 (.01) η12 .28 (.05) .44 (.06) .28 (.05) .28 (.03) .44 (.03) .28 (.03) .28 (.01) .44 (.02) .28 (.02) η21 .29 (.05) .43 (.05) .28 (.05) .29 (.02) .43 (.03) .29 (.03) .29 (.02) .43 (.02) .29 (.02) η22 .29 (.05) .43 (.06) .28 (.05) .29 (.02) .43 (.03) .28 (.02) .29 (.01) .43 (.02) .28 (.01) Φ1 .28 (.06) .43 (.09) .29 (.06) .27 (.04) .45 (.05) .27 (.04) .27 (.03) .46 (.03) .27 (.03) Φ2 .29 (.05) .43 (.06) .28 (.06) .28 (.03) .44 (.04) .28 (.03) .28 (.02) .45 (.02) .27 (.02)
Note: Values in parenthesis represent standard errors
33
Table 13. Classification indices by conditions and sample size
Condition Sample
size Classification
Rater Mode of observation Exercise 1 Exercise 2
Direct Video Direct Video Direct Video
1
100 Pc .956 .954 .961 .963 .878 .918 λ .923 .918 .928 .932 .774 .848
400 Pc .950 .952 .954 .954 .877 .916 λ .914 .913 .920 .918 .772 .848
1000 Pc .948 .949 .952 .952 .877 .911 λ .909 .910 .915 .915 .770 .837
2
100 Pc .954 .957 .963 .961 .876 .919 λ .921 .924 .935 .931 .778 .857
400 Pc .949 .952 .956 .956 .884 .917 λ .913 .914 .922 .922 .787 .850
1000 Pc .948 .949 .953 .953 .881 .916 λ .911 .909 .918 .918 .776 .847
3
100 Pc .956 .958 .966 .963 .880 .921 λ .923 .923 .939 .935 .782 .859
400 Pc .952 .954 .959 .960 .887 .920 λ .917 .917 .929 .930 .792 .856
1000 Pc .951 .952 .958 .959 .887 .916 λ .915 .914 .927 .927 .789 .849
Note: Proportion correctly classified (Pc) and λ are both measures of classification based on posterior probability (Clogg, 1995). The λ statistic accounts for classification that can occur by chance. Values in parenthesis represent standard errors.
7. Conclusion
This paper reviews psychometric rater models used in the measurement literature to
refine measures of non-cognitive attributes. While the use of non-cognitive attributes provide
new approaches to target interventions that can impact human capital, measurement issues have
yet to be resolved. This paper contributes to the literature in this regard, by proposing a solution
to generate model-based scores that can provide more refined estimates. To demonstrate this
application, psychometric rater models used in the educational measurement and mathematical
psychology literature are presented. In addition, a new model, extending the existing foundation
of LC-SDT is also proposed.
34
The analysis conducted in this paper show the utility of applying these techniques. First,
the CPS teacher evaluation data were fit using the LC-SDT model. Results showed that model-
based scores that account for rater effects generated larger effect sizes with value-added scores.
Although modest, the difference when compared to traditional techniques that use linear
regression can be quite large when taken into context with other value-added results shown in the
literature (Bill and Melinda Gates Foundation, 2012). Moreover, using latent class regression
that incorporates a psychometric rater model may yield more refined results when compared to
traditional value-added models; further investigation is needed.
This study also contributes by proposing a new method that accounts for mode of
observation. Many non-cognitive attributes can be directly observed or measured through post-
hoc mechanisms such as video playback. Findings from the real-world data analysis show utility
in this approach. The monte carlo simulation results also show promise in the continued
development of these techniques as more refined methods to capture learner’s non-cognitive
attributes.
Recently, there has been an increase in observation–based methods to assess candidates,
as scoring can be based on live or video–based observations – such testing practice is
administered frequently in medical education and in other professions. In the K–12 education
literature, measuring effective teaching has been conducted onsite by observers or offsite using
video recordings. Given the increase in observations to measure performance, a measurement
model that accounts for modes of observation is necessary.
The HRM–MO model proposed in this study provides a framework for extending the
HRM, which previously only accounted for raters at level 1 and the items at level 2. The HRM–
MO accounts for a separate level between the rater and item levels that models the effect of
35
observation mode. This can be a useful approach for researchers as multiple modes of
observation can be applied in high–stakes testing. Quality of observation mode can provide
information for planning the scoring design. In addition, this study contributes to the growing
literature on developments of HRM, which can lead to improved measurements of examinee
performance.
The HRM–MO used in this study can be a useful model for studying modes of
observations. It provided greater explanation on differences between modes of observation than
the simple rater agreement statistics or the traditional HRM–SDT. The model fit indices based on
HRM–MO also showed improved fit, which could be a promising indication for further
development of this model. Simulation results also showed interesting patterns regarding the
higher–level parameters in level 3. Conditions that reflect better estimation for item parameters
should be examined as part of future research.
As greater emphasis is placed on investing in non-cognitive attributes of learners at
various stages of training, additional care should be applied in its measurement. Although much
work in the professions education and educational measurement literature has contributed to this
effort, translation of these techniques to further reduce gaps between disciplines may be of need.
While measurement sciences focus on improving the precision around constructs, a method to
align these methodological trends with long-term outcomes would support better estimation and
deeper understanding of how non-cognitive attributes influence human development and
potential.
36
References
Abramson, David, Yoon Soo Park, Tasha Stehling-Ariza, and Irwin Redlener. 2010. “Children as
bellwethers of recovery: Dysfunctional systems and the effects of parents, households,
and neighborhoods on serious emotional disturbance in children after Hurricane Katrina”
Disaster Medicine and Public Health Preparedness 4:S17–S27.
Agresti, Alan. 2002. Categorical data analysis. Hoboken: Wiley.
Almlund, Mathilde, Angela Duckworth, James Heckman, and Tim Kautz. 2011. “Personality
psychology and economics.” In Handbook of the Economics of Education, edited by Eric
Hanushek, Stephen Machin, and Ludger Wöβmann. Amsterdam: Elsevier.
Beard, J. D., B. C. Jolly, D. I. Newble, W. E. Thomas, J. Donnelly, and L. J. Southgate. 2005.
“Assessing the technical skills of surgical trainees.” British Journal of Surgery 92 (6):
778–82.
Bill and Melinda Gates Foundation, Measures of Effective Teaching (MET). 2012. Gathering
feedback for teaching: Combining high–quality observations with student surveys and
achievement gains. Seattle: Bill and Melinda Gates Foundation.
Cardy, Robert L., and Gregory Dobbins. 1986. “Affect and appraisal accuracy: Liking as an
integral dimension in evaluating performance.” Journal of Applied Psychology 71:672–
678.
City of Columbus Civil Service Commission. 2012. 2012 police lieutenant and commander
promotional examination: Test guide. Columbus: City of Columbus Civil Service
Commission.
Clogg, Clifford, and Wendy D. Manning. 1996. “Assessing reliability of categorical
measurements using latent class models.” In Categorical variables in developmental
37
research, edited by Alexander von Eye and Clifford C. Clogg. New York: Academic
Press.
Cohen, Jacob. 1960. “Coefficient of agreement for nominal scales.” Educational and
Psychological Measurement 20:37–46.
Cohen, Jacob. 1968. “Weighted kappa: Nominal scale agreement with provision for scaled
disagreement or partial credit.” Psychological Bulletin 70: 213–220.
Danielson, Charlotte. 2007. Enhancing professional practice: A framework for teaching.
Alexandria, VA: Association for Supervision and Curriculum Development.
Dayton, Chauncey Mitchell. 1998. Latent class scaling models. Thousand Oaks: Sage.
DeCarlo, Lawrence T. 2002. “A latent class extension of signal detection theory, with
applications.” Multivariate Behavioral Research 36:423–451.
DeCarlo, Lawrence T., and Youngkoung Kim, and Matthew S. Johnson. 2011. “A hierarchical
rater model for constructed responses, with a signal detection rater model.” Journal of
Educational Measurement 48: 333–356.
Ezra, Daniel, Raj Aggarwal, Michel Michaelides, Narciss Okhravi, Seema Verma, Larry
Benjamin, Philip Blom, Ara Darzi, and Paul Sullivan. 2009. “Skills acquisition and
assessment after a microsurgical skills course for ophthalmology residents.”
Ophthalmology 116:257–262.
Groot, Wim. 2000. “Adaptation and scale of reference bias in self-assessments of quality of life.”
Journal of Health Economics 19 (3):403–420.
Gutman, Leslie, and Ingid Schoon. 2013. The impact of non-cognitive skills on outcomes for
young people. London: Institute of Education.
38
Heckman, James, and Tim Kautz. 2013. “Fostering and measuring skills: Interventions that
improve character and cognition.” Working Paper no. 2013-019, HCEO, Chicago, IL.
Heckman, James, Rodrigo Pinto, and Peter Savelyev. 2013. Understanding the mechanisms
through which an influential early childhood program boost adult outcomes. American
Economic Review 103 (6):1–35.
Heckman, James, and Yona Rubinstein. 2001. “The importance of noncognitive skills: Lessons
from the GED Testing Program.” American Economic Review. 91 (2):145–9.
Hely, M. A., T. Chey, A. Wilson, P. M. Williamson, D. J. O’Sullivan, D. Rail, J. G. Morris.1993.
“Reliability of the Columbia scale for assessing signs of parkinson’s disease.” Movement
Disorders 8:466–472.
Jackson, C. Kirabo. 2013. “Non-cognitive ability, test scores, and teacher quality: Evidence from
9th grade teachers in North Carolina.” Working Paper no. 18624, NBER, Cambridge, MA.
Jacob, Brian A., and Lars Lefgren. 2008. “Can principals identify effective teachers? Evidence
on subjective performance evidence in education.” Journal of Labor Economics 26 (11)
101–36.
Makoul, Gregory, and Raymond Curry. 2007. “The value of assessing and addressing
communication skills.” Journal of American Medical Association 298:1057–9.
Mariano, Louis. T. 2002. “Information accumulation, model selection and rater behavior in
constructed response student assessments.” Manuscript, Carnegie Mellon University.
McLachlan, Geoffrey, and Thriyambakam Krishnan. 2008. The EM algorithm and extensions.
San Francisco: Wiley.
Muraki, Eiji. 1992. “A generalized partial credit model: Application of an EM algorithm.”
Applied Psychological Measurement 16:159–176.
39
Olshfski, Dorothy, and Robert Cunningham. 1985. “Improving management effectiveness by
training – The use of video techniques to assist self–appraisal.” Technovation 3:235–242.
Park, Yoon Soo, Steven Holtzman, and Jing Chen. 2014. “Evaluating efforts to minimize rater
bias in scoring classroom observations.” In Designing Teacher Evaluation Systems: New
Guidance from the Measures of Effective Teaching Project, edited by Thomas Kane,
Kerri Kerr, and Robert Pianta. San Francisco: Wiley.
Park, Yoon Soo, and Young-Sun Lee. 2014. “An extension of the DINA model using covariates:
Examining factors affecting response probability and latent classification.” Applied
Psychological Measurement 38 (5): 376–90.
Park, Yoon Soo, Janet Riddle, and Ara Tekian, 2014. “Validity Evidence of resident competency
ratings and the identification of problem residents.” Medical Education 48 (6):614–22.
Patz, R. J. 1996. “Markov chain Monte Carlo methods for item response theory models with
applications for the National Assessment of Educational Progress.” Manuscript, Carnegie
Mellon University.
Patz, Richard, Brian Junker, Matthew Johnson, and Louis Mariano. 2002. “The hierarchical rater
model for rated test items and its application to large–scale educational assessment data.”
Journal of Educational and Behavioral Statistics 27:341–384.
Pratt, Travis C., and Francis T. Cullen. 2000. “The empirical status of Gottfredson and Hirschi’s
general theory of crime: A meta-analysis.” Criminology 38 (3): 931–64.
Rockoff, Jonah E., and Cecilia Speroni. 2010. “Subjective and objective evaluations of teacher
effectiveness.” American Economic Review 100:261–6.
Shaeffer, Gary, Jackqueline Briel, and Mary Fowles. 2001. “Psychometric evaluation of the new
GRE writing assessment.” ETS Research Report No. RR–01–18, Princeton, NJ.
40
Tate, Richard. 1999. “A cautionary note on IRT–based linking of tests with polytomous items.”
Journal of Educational Measurement 36:336–46.
Taylor, Shelley, and Susan Fiske. 1978. “Salience, attention, and attributions: Top of the head
phenomena.” In Advances in experimental social psychology, edited by L. Berkowitz.
New York: Academic Press.
van der Vleuten, C. P., and David Swanson. 1990. “Assessment of clinical skills with
standardized patients: State of the art.” Teaching and Learning in Medicine 2:58–76.
Vassiliou, Melina, Liane Feldman, Shannon Fraser, Patrick Charlebois, Prosanto Chaudhury,
Donna Stanbridge, and Gerald Fried. 2007. “Evaluating intraoperative laparoscopic skill:
Direct observation versus blinded videotaped performances.” Surgical Innovation 14
(3):211–6.
Vermunt, Jeroen and Jay Magidson. 2005. Technical guide for Latent Gold 4.0: Basic and
advanced. Belmont: Statistical Innovations, Inc.
Vincent, Charles, Magi Young, and Angela Phillips. 1994. “Why do people sue doctors? A study
of patients and relatives taking legal action.” Lancet 343:1609–13.
Vivekananda–Schmidt, Pirashanthie, Martyn Lewis, David Coady, Catherine Morley, Lesley
Kay, David Walker, and Andrew Hassell. 2007. “Exploring the use of videotaped
objective structured clinical examination in the assessment of joint examination skills of
medical students.” Arthritis & Rheumatism 57:869–76.