ssm 12130
DESCRIPTION
Ssm 12130TRANSCRIPT
7/17/2019 Ssm 12130
http://slidepdf.com/reader/full/ssm-12130 1/11
Measuring Sixth-Grade Students’ Problem Solving: Validating an
Instrument Addressing the Mathematics Common Core
Jonathan David Bostic Bowling Green State University
Toni A. Sondergeld Bowling Green State University
This article describes the development of a problem-solving instrument intended for classroom use that addresses theCommon Core State Standards for Mathematics. In this study, 137 students completed the assessment, and their
responses were analyzed. Evidence for validity was collected and examined using the current standards for educational
and psychological testing. Instrument validation findings regarding internal consistency reliability were high, and
multiple forms of validity (i.e., content, response processes, internal structure, relationship to other variables, and
consequences of testing) were found to be appropriate. The resulting instrument provides teachers and researchers with
a sound tool to gather data about sixth-grade students’ problem solving in the Common Core era.
Problem solving has been a notable theme within math-
ematics education (National Council of Teachers of Math-
ematics [NCTM, 1989, 2000, 2009]) and the importance is
clearly seen in the Common Core State Standards for Mathematics (CCSSM; National Governors Association
and Council of Chief State School Officers [NGA &
CCSSO, 2010]). A central feature of the CCSSM is a keen
focus on mathematical problem solving, which is high-
lighted as its own Standard for Mathematical Practice
(SMP) but also woven throughout several Standards for
Mathematics Content (SMCs; Kanold & Larson, 2012).
New standards mean that old measures of student learning
need revisions and revalidation, or new measures must be
created and validated to insure alignment of classroom
curriculum and instruction with assessment. The purposeof this study was to pilot and validate a new measure of
sixth-grade students’ problem-solving abilities addressing
CCSSM content and discuss its potential for future use.
Related Literature
Problems and Problem-Solving Framework
Problems are characterized as tasks that meet the
following criteria: (a) It is unknown whether a solution
exists, (b) a solution pathway is not readily determined,
and (c) there exists more than one way to answer the task
(Schoenfeld, 2011). Problems are distinct from exercises
(Kilpatrick, Swafford, & Findell, 2001), and problemsolving goes beyond the type of thinking needed to solve
exercises (Mayer & Wittrock, 2006; Polya, 1945/2004).
Lesh and Zawojewski (2007) characterize problem solving
as involving “several iterative cycles of expressing, testing
and revising mathematical interpretations—and of sorting
out, integrating, modifying, revising, or refining clusters
of mathematical concepts from various topics within and
beyond mathematics” (p. 782). Many, including CCSSM
authors, have suggested that students ought to experience
developmentally appropriate tasks that are open, realistic,
and complex (Boaler & Staples, 2008; Bostic, Pape, &
Jacobbe, in press; Palm, 2006; Verschaffel et al., 1999).These sorts of tasks are often found in outside-of-school
contexts (Boaler & Staples, 2008; Bostic et al., in press)
and they provide opportunities for students to demonstrate
critical thinking (Bostic, 2015; Lesh & Zawojewski, 2007;
Matney, Jackson, & Bostic, 2013). “Open” tasks can be
solved in different ways and offer learners multiple entry
points while problem solving. “Realistic” tasks draw upon
a problem solver’s experiential knowledge and engage the
student in a task that might occur in the real world.
“Complex” tasks require an individual to persevere and
employ sustained reasoning to solve it. Such open, realis-tic, and complex tasks offer opportunities for students to
exhibit mathematical behaviors and habits described in the
SMPs that connote problem solving (NGA & CCSSO,
2010; see Table 1). The SMPs are connected to similar
mathematics behaviors and habits described in character-
izations of mathematical proficiency (Kilpatrick et al.,
2001) as well as the NCTM’s process standards (NCTM,
2000). Thus the SMPs are not necessarily new ideas;
instead, they are clearly “linked to mathematical goals
articulated in previous documents and by other groups”
(Koestler, Felton, Bieda, & Otten, 2013, p. v).
With these mathematical behaviors and habits in mind, itis necessary to create measures that assess students’ math-
ematics content knowledge through open, complex, and
realistic tasks addressing the CCSSM content and practice
standards.
Measures of Problem-Solving Ability
A review of the literature through multiple scholarly
search engines (e.g., EBSCO, Google Scholar, and
Science Direct) demonstrated that content standards at the
School Science and Mathematics 281
7/17/2019 Ssm 12130
http://slidepdf.com/reader/full/ssm-12130 2/11
state and national levels have not been assessed in current
problem-solving measures. Table 2 provides a list of
merely four measures found in the literature that address
mathematical problem solving (not mathematical achieve-
ment). All were discussed in peer-reviewed journals, and
evidence for validity of the measures was shared. Mea-
sures published in journals and books without peer reviewand/or measures without evidence for validity were not
considered in our review.
Previous problem-solving measures for middle-school
students can be described as using one of two types of
problem-solving measures. The first set includes analysis
of large-scale data sets such as the Programme for Inter-
national Student Assessment and National Assessment
of Educational Progress (Organization for Economic
Development, 2010; National Center for Education
Statistics, 2009). The second set of studies draws on
locally constructed measures (e.g., Charles & Lester,
1984; Verschaffel et al., 1999). Taken collectively, thesestudies lay a foundation for examining middle-grade stu-
dents’ problem-solving ability. They also suggest a need
for measures that support assessment of students’ math-
ematical problem solving such that the mathematics
covered in the items addresses the mathematics content
that students are expected to learn in their mathematics
classes (i.e., standards-aligned assessments). Our study
aims to validate a new measure of problem solving that
will work toward meeting this need.
Belgian word-problem tests. The initial design of the
test described in this manuscript stems from two previous
problem-solving word-problem (WP) tests constructed to be parallel in nature (different items with the same
content) for use with Belgian fifth-grade students
(Verschaffel et al., 1999). The goal of their investigation
was to explore the impact of supplementing typical math-
ematics instruction with problem-solving instruction,
specifically researching students’ problem-solving perfor-
mance. They created two parallel measures (WP pretest
and WP posttest) composed of 10 open, realistic, and
complex WP. An item from the WP pretest states, “Martha
is reading a book. Suddenly she finds out that some pages
are missing because page 135 is immediately followed by
page 173. How many pages are missing?” (Verschaffelet al., 1999, p. 214). Each item on one test had a similar
but not identical task on the other test, which was assumed
to be parallel in content and difficulty. Verschaffel’s
research provided two problem-solving instruments (WP
pretest and posttest) that most closely met the intent of our
work, with similar aged students. Hence their work
grounded construction of our measure.
In a validation study, a total of 232 Belgian fifth-grade
students completed Verschaffel et al. (1999) problem-
solving measures. Internal consistency results suggested
moderate levels of reliability for their measures,Cronbach’s α = .56 (pretest) and .75 (posttest). Items and
measures were deemed mathematically correct and devel-
opmentally appropriate by an expert panel consisting of
mathematicians and mathematics educators. Furthermore,
the panel agreed that items were open, complex, and real-
istic to students completing the measures. Results indi-
cated that students’ mean score across both measures
averaged 1.6 correct responses on a 10-item measure. This
Table 1
Standards for Mathematical Practice
SMP # Title
1 Make sense of problems and persevere in solvingthem.
2 Reason abstractly and quantitatively.3 Construct viable arguments and critique the
reasoning of others.4 Model with mathematics.5 Use appropriate tools strategically.6 Attend to precision.7 Look for and make use of structure.8 Look for regularity in repeated reasoning.
Table 2
Characteristics of Measures Developed to Assess Middle-School Students’ Problem Solving
Measure name Author (Year) Format Age/gradelevel
Aligned with stateor national standards
Programme for InternationalStudent Assessment (PISA)
Organization for EconomicDevelopment (2010)
Multiple choice and constructed response
15–16 yearsold
None indicated
National Assessment of Educational Progress (NAEP)
National Center for EducationStatistics (2009)
Multiple choice and constructed response
8th grade None indicated
No Name Verschaffel et al. (1999) Constructed response 5th grade None indicated No Name Charles and Lester (1984) Construct response 5th and 7th
grades None indicated
Measuring
282 Volume 115 (6)
7/17/2019 Ssm 12130
http://slidepdf.com/reader/full/ssm-12130 3/11
result provides support that problem solving is in fact
difficult for students to master.
These measures were used to create an initial problem-
solving instrument for use in the United States (Bostic,
Pape, & Jacobbe, 2011).To do this, a multistep process was
completed. First, an individual who previously taught
Dutch at the university level translated the instruments intoEnglish. Second, items consisting of only one sentence
were not used in the English version because they were
significantly shorter than most tasks. For instance, the
readability score using Flesch–Kincaid analysis (Kincaid,
Fishburne, Rogers, & Chissom, 1975) was much lower on
items consisting of one sentence compared with others.
Third, problems were revised to update contexts, to reflect
U.S. students’ experiences, and to clarify the language.
Finally, three rounds of pilot testing, collection of evidence
for validity, analysis of psychometric properties, and revi-
sion supported creating the measure that is the focus of this
paper.Validity and Reliability of Tests
To be considered a sound measure, tests should provide
multiple pieces of evidence for validity as well as reliability
(American Educational Research Association [AERA],
American Psychological Association, & National Council
on Measurement in Education, 2014; Gall, Gall, & Borg,
2007). Sufficient validity evidence is needed to determine
the degree to which interpretations of test scores are sup-
ported by use of the tests (AERA et al., 2014; Gall et al.,
2007). Greater validity evidence leads to stronger confi-
dence in the interpretations of score reports. There arenumerous types of validity evidence discussed in research
literature; the “five main types of evidence for demonstrat-
ing the validity of test-score interpretations [are] evidence
from: test content, response processes, internal structure,
relationship to other variables, and consequences of
testing” (Gall et al., 2007, p. 195). Test content validity
evidence indicates the degree to which content (or items)
found on the measure addresses the construct of interest.
This is a judgment call typically determined by a panel of
content experts. Response processes validity evidence sug-
gests how the processes thought to be engaged in by the
respondent when completing the test are consistent with aknown construct (Gall et al., 2007). One technique for
gathering response process evidence is to conduct cognitive
interviews or think-aloud interviews with participants rep-
resentative of those who would be expected to complete the
assessment. Internal structure evidence describes how one
item relates to others. Evaluating how the items work
together as a construct through traditional methods (i.e.,
factor analysis) or modern measurement (i.e., Rasch analy-
sis) are typical means of gathering evidence in this area.
Validity evidence for relationship to other variables may
take on various forms. Test makers might seek convergent
or divergent evidence that examines how test scores corre-
late with scores on other measures (i.e., similarly and
differently, respectively). Another approach is to examine
how score distributions compare for two groups hypoth-esized to be similar or different. Finally, consequences of
testing validity describes how the values implied by and
consequences of test taking a test impact respondents.
Interviews with respondents following measure adminis-
tration may add evidence indicating to what degree test
taking influenced respondents affectively and cognitively.
Validity evidence is necessary but not sufficient for a
high-quality test (AERA et al., 2014); thus test construc-
tion must also include an examination of internal consis-
tency (i.e., reliability). Internal consistency reliability
estimates the “coefficient of precision from a set of
real test scores” (Crocker & Algina, 2006, p. 117). Thereare multiple internal consistency measures such as
Cronbach’s α , Rasch reliability, and Raykov’s approach.
There is no best method for assessing internal consistency
because each has strengths and limitations. Historically,
classical testing theory approaches were used to examine
psychometric properties of measures (Crocker & Algina,
2006). However, in the last 30 years, an alternate psycho-
metric analysis framework arose called item response
theory (IRT). IRT has become a popular approach to inves-
tigate psychometric properties of tests because of its
advantages over classical testing theory in multiple ways.We discuss one type of IRT analysis, Rasch modeling
(Rasch, 1960/1980), and some differences between it and
classical testing theory in the next section as a way of
examining item difficulty and item discrimination as well
as the properties of the test as a whole.
Item Response Theory: Rasch Modeling
Rasch modeling, often referred to as one-parameter IRT,
has four key statistical assumptions: (a) Ability is a unidi-
mensional trait, (b) items are locally independent, (c) the
probability of correctly answering items increases as ability
increases, and (d) item parameters are independent of
respondents’ abilities (De Ayala, 2009; Embretson &Reise, 2000). Rasch modeling has multiple benefits over
Classical Testing Theory, or CTT (De Ayala, 2009;
Embretson & Reise, 2000). First, Rasch methods lead to
results that offer trait-level estimates of an individual’s
ability that depend on an individual’s responses and item
properties (Embretson & Reise, 2000). That is, Rasch mod-
eling allows individuals to be measured against the con-
struct (i.e., criteria or items) rather than a norm-referenced
Measuring
School Science and Mathematics 283
7/17/2019 Ssm 12130
http://slidepdf.com/reader/full/ssm-12130 4/11
sample, which elicits a criterion-referenced interpretation
of results.Thus, multiple populationscan be compared with
one another, which cannot necessarily be done with CTT
because results are sample dependent with CTT.
A second benefit of using Rasch methods is that mea-
sures using a Rasch framework are likely to offer more
accurate estimates of problem-solvers’ abilities (De Ayala,2009; Embretson & Reise, 2000). For example, let us
suppose that two students, Student A and Student B, both
correctly answer 6 items on a 10-item assessment. Student
A correctly answers the six easiest items where Student B
correctly answers the six most difficult items. With CTT,
both students earn a 60% regardless of item difficulty. If
using Rasch methods, due to the conjoint measurement
model placing items and people on the same ruler, then
Student B earns a higher score than Student A because
item difficulty is taken into consideration.
Third, Rasch modeling approaches require fewer total
items to accurately measure someone’s ability compared with CTT approaches (De Ayala, 2009), and missing data
are not problematic when using Rasch modeling because
of the probabilistic nature of the conjoint item/person
measure (Bond & Fox, 2007). In practice, this means that
a test taker can fail to complete the test or skip items they
are unsure of, and Rasch methods still have the capacity to
estimate an accurate person’s ability measure based on the
data that were collected by relating the correct item
responses to their item difficulty.
Finally, each test completer has his/her own standard
error estimate rather than assuming one for all respondents(Embretson & Reise, 2000), which elicits more precise
ability measures. In summary, using Rasch methods for
measure creation and refinement is considered one of
the best approaches by many social science researchers
because of its ability to convert ordinal data into conjoint,
hierarchical, equal-interval measures that place both
person abilities and item difficulties on the same scale so
they can be directly compared with each other (see Bond
& Fox, 2007).
With so many advantages in using Rasch modeling over
CTT being noted in terms of assessment construction
and validity assessment (i.e., Smith, 1996; Waugh &Chapman, 2005; Wright, 1996), one might wonder why
more modern measurement techniques still fail to domi-
nate the field of test construction in comparison with CTT.
Two main reasons have been cited for this discrepancy
(Smith, Conrad, Chang, & Piazza, 2002). First, CTT
assumptions are considered “weak” and easily met by test
developers, whereas Rasch specifications are stricter and
at times render data unusable if they do not fit the model.
Second, although traditional CTT statistics are taught in
introductory courses to most graduate students, Rasch (or
IRT) measurement methods are only taught in more
advanced courses and to far fewer individuals. In fact,
many universities do not offer such courses to their gradu-
ate students at all. With the advantages of Rasch modeling
in mind, along with the need to revise or develop newCCSSM-aligned assessments, it is an ideal time to employ
more advanced measurement methods to advance research
on middle-grades students’ problem solving.
Objectives of This Study
In this study, we examine the psychometric qualities
(i.e., validity and reliability evidence discussed earlier) of
a new measure of student mathematics problem-solving
ability focusing on grade six SMCs. The measure is called
the Grade 6 Problem-Solving Measure (PSM6). Our
overarching research question is: What are the psychomet-
ric properties for the PSM6? We share evidence for the five
main types of validity (test content, response processes,internal structure, relationship to other variables, and con-
sequences of testing), as well as test bias, construct valid-
ity, internal consistency reliability, unidimensionality, item
difficulty to student ability targeting, and item function.
Method
Data Sources, Collection, and Procedures
Grade 6 problem-solving measure. The PSM6 has
evolved over the course of previous investigations. One
previous form and its related psychometric properties (i.e.,
internal consistency and item characteristics) are discussed in Bostic et al. (2011). Several revisions have taken place
since that publication. First, more items drawing on the
same latent trait and content area (e.g., geometry, statistics,
expressions, and equations, etc.) were added to produce
better estimates of sixth-grade students’ problem-solving
abilities. Previous versions omitted the geometry domain,
whereas the current version contains three geometry items
aligning it better with CCSSM. A second issue was creat-
ing easier and/or more difficult items for each domain. For
instance, previous versions had statistics and probability
items that were correctly solved by a large percentage of
respondents and considered easier items. There were nodifficult statistics and probability items related to this
content area to balance the easier items hence revisions
were needed to rectify this issue.An item from the PSM6 is
located in Figure 1.
When distributed to students, one item was displayed on
each page and figures were large in size for ease in read-
ability. Item descriptions and their associated SMCs are
presented in Table 3.
Measuring
284 Volume 115 (6)
7/17/2019 Ssm 12130
http://slidepdf.com/reader/full/ssm-12130 5/11
Grade 6 students from a suburban school district in aMidwest state were contacted during the last nine-week
period of the academic year about their willingness to
complete the PSM6. A total of 137 students volunteered
to complete it during one administration. None of the
respondents was an English-language learner or receiving
services for a disability or giftedness. Test administration
took approximately 75 minutes.
Qualitative data. Various sources of qualitative data
were collected to inform the findings of this study.The first
set of data came from a content expert panel consisting of
one mathematician holding a Ph.D., two university-level
mathematics educators holding terminal degrees, and twosixth-grade teachers. Panel members shared that they were
familiar with the SMC and SMPs. They were asked to
examine the PSM6 and consider the following questions
for each item.
• Does the item address one or more sixth-grade
SMCs? If so, which one(s)?
• Does the item provide a context to engage students in
the SMPs?
• Is there more than one developmentally appropriate(i.e., grade level or lower) way to solve the problem?
• Is the item complex enough to be considered a
problem?
• Does the item draw on realistic contexts that students
might recognize?
• Do you perceive bias in the item toward any group of
individuals?
The mathematician also responded to two additional
prompts.
• Does the item have a well-defined solution set?
• Is the mathematics in the item accurate and presented
clearly?A second set of qualitative data stemmed from cognitive
interviews with 10 students. The purpose of these inter-
views was twofold. One purpose was to assess whether
students could read and comprehend the words in the
items so they might understand the situations embedded
within them. The second purpose was to assess students’
feelings about solving these problems and more generally,
mathematical problem solving. There were five boys and
A group of 150 tourists were waiting for a shuttle to take them from a parking lot to a theme
park’s entrance. The only way they could reach the park’s entrance was by taking this shuttle.The shuttle can carry 18 tourists at a time. After one hour, everyone in the group of 150 tourists
reached the theme park’s entrance. What is the fewest number of times that the shuttle picked
tourists up from the parking lot?
Figure 1. Sample PSM6 item. Item description is “Tourist shuttle.” Connections to SMCs are found in Table 3.
Table 3
Connections to Standards for Mathematics Content
Question # Description Standard for mathematics content
Revised or added Primary* Secondary
1 Ice cream No revision 6.SP.1 – 2 Lightning bolt Added 6.G.1 – 3 Wooden gate No revision 6.NS.3 6.NS.14 Tourist shuttle No revision 6.RP.3 6.EE.75 Water park No revision 6.NS.3 – 6 Silly bandz No revision 6.EE.7 – 7 Sam’s box Added 6.G.2 – 8 Sandhill lunch No revision 6.SP.1 – 9 Bicycle No revision 6.EE.2 –
10 Jerome’s paint Revised 6.RP.3 – 11 Youth group No revision 6.NS.3 – 12 Animal day care No revision 6.RP.3 6.EE.713 Julie’s fish Added 6.SP.5 – 14 Pyramid Added 6.G.4 6.G.115 Glass bottom boat No revision 6.RP.3 6.EE.7
Note. *Descriptions of each CCSSM domain abbreviation are EE = expressions and equations; G = Geometry; NS = number sense;RP = ratio and proportion, and SP = statistics and probability.
Measuring
School Science and Mathematics 285
7/17/2019 Ssm 12130
http://slidepdf.com/reader/full/ssm-12130 6/11
five girls and each one characterized himself or herself as
African American, Caucasian (not Hispanic or Latino/a),
Hispanic, or Caucasian (Hispanic or Latino/a). Mathemat-
ics teachers assisted with sample selection diversity by
suggesting students who typically performed below
average, on average, and above average in comparison
with their peers. Thus, at least one boy and one girl wererepresentative of each ability.
Students were presented with an introductory task to
prime them for thinking aloud. Next, they were given one
task at a time. An interviewer asked students to solve
problems and voice their thinking during problem solving.
After solving the problem, the interviewing asked the
student whether he/she might show another way to solve
the same problem. When the student could not share a
different way to solve the problem, the interviewer pro-
ceeded to the next item. The interviewer did not answer
students’ questions but instead encouraged them to reread
the task and think about what it asked. At the end of theinterview, students were asked to describe how they felt
about the items and the process of solving these problems.
Data Analysis
Quantitative and qualitative data analyses were both
employed in this study depending on the data type. Quali-
tatively, inductive analysis (Hatch, 2002) was conducted to
assess content validity evidence (expert review), test bias
(expert review and student cognitive interviews), and con-
sequences of test-taking validity evidence (student cogni-
tive interviews). Interview data were transcribed and
salient themes that emerged across domains were created using inductive analysis as well. In sum, inductive analysis
is the process of identifying salient themes from data sets
(Glaser & Strauss, 1967/2012; Hatch, 2002; Strauss &
Corbin, 1998). Our implementation of Hatch’s (2002)
approach is shared to understand how we arrived at our
findings. First, we familiarized ourselves with the qualita-
tive data by rereading all materials (e.g., expert panel
reviews, transcribed cognitive interviews, etc.). Next,
initial ideas stemming from this rereading were recorded
as memos to consider during later analyses. Then, we
reflected on those memos in order to draw out key impres-
sions needed for validity evidence. Later, counter evidencefor our impressions was sought. Impressions with a lack of
strong counter evidence remained as emergent themes to
share as our qualitative findings.
All remaining forms of validity and reliability evidence
were analyzed quantitatively. Response process validity
was analyzed descriptively with raw scores. Relationship
to other variables validity was analyzed using independent
samples t -tests to compare demographic differences in
problem-solving ability (i.e., gender and ability level).
SPSS for Windows, Release Version 17.0 (SPSS, Inc.,
2008, Chicago, IL, USA; http://www.ibm.com/software/
analytics/spss) was used for these statistical analyses. Psy-
chometric properties related to all other forms of validity
and reliability evidence were analyzed using Rasch
methods for dichotomous responses (Rasch, 1960/1980)using Winsteps Version 3.74.0 (Linacre, 2012).
Psychometric Results for PSM6
Validity Evidence From Test Content
Expert panel content review. The expert review panel
had favorable reviews of the items in terms of CCSSM
content alignment and item focus on problem solving.
They reported the following themes: (a) Items addressed
content and practices found in the sixth-grade CCSSM, (b)
items were complex enough that a solution was not imme-
diately recognized, (c) items were open enough to be
solved in at least two unique ways, and (d) items drew onrealistic contexts. Furthermore, every item had a well-
defined solution set and the mathematics was correct and
accurate for the problems’ situations.
Readability from cognitive interviews and Flesch–
Kincaid analysis. Results from the cognitive interviews
indicated that below-average, average-, and above-average
performing students were able to read and solve problems
on the PSM6. Readers voiced ideas about problem solving
during the interview that matched their written work and
furthermore were found to be similar to work seen on
the larger data set. As a complement to the interview,readability for the test was calculated using the Flesch–
Kincaid grade-level indicator (Kincaid et al., 1975). The
Flesch–Kincaid grade level was 4.5. This suggests that a
fifth-grade student could read the PSM6, surpassing
expectations for a sixth-grade assessment. Collectively,
raw score test statistics and readability measures (i.e., cog-
nitive interviews and Flesch–Kincaid rating) suggest that
response process validity evidence was high.
Test Bias
Expert panel identified test biases. In addition to the
student interviews, we asked the expert panel to share
whether they noticed any biases arising from test admin-istration. Specifically, we asked them to consider cultural
(i.e., ethnicity and community-based), religious, and
gender-based biases found in the PSM6 that might influ-
ence respondents’ outcomes. The committee felt that there
were no evident biases that helped or inhibited a problem-
solver’s outcomes.
Student-identified test biases. Later during the cogni-
tive interviews, respondents were asked about any biases
Measuring
286 Volume 115 (6)
7/17/2019 Ssm 12130
http://slidepdf.com/reader/full/ssm-12130 7/11
they noticed in the tests (e.g., “Do you feel the test was
unfair to you or anyone who might take it?”). The respon-
dents felt the test was fair and addressed contexts they
understood. One student asked whether having prior
knowledge was the same as having an unfair advantage,
“I’ve been to Disney World where they have those shuttles
that take you from the parking lot to the front of it. If youhave ridden one or seen one then I think this [item #4]
would be really easy but then again, maybe not.” These
findings provide evidence that the PSM6 addressed several
facets of test content validity.
Validity Evidence From Response Processes
The average student raw score was 5.7 (SD = 3.1; 38%
correct). The lowest and highest scores were 1.0 and 12.0,
respectively. At first glance, these scores might seem
disproportionally low for a 15-item measure. However,
this is reasonable because research on students’ problem
solving has indicated that students tend to have low
scores on problem-solving measures (e.g., Verschaffelet al., 1999). Problem-solving tasks are frequently more
difficult than tasks asking students to apply a specific
procedure (Lesh & Zawojewski, 2007; Mayer &
Wittrock, 2006).
Validity Evidence From Internal Structure
Unidimensionality. A fundamental quality of all mea-
surement is unidimensionality. A unidimensional mea-
surement assesses only one latent variable or trait. To
evaluate unidimensionality of the PSM6 using Rasch mea-
surement, a brief description of the statistical criteria used
and respective results are provided. Items with negative point biserial correlations or infit/outfit mean square
(MNSQ) fit statistics falling outside .5–1.5 logits (log odds
units) are not meaningful for measurement (Linacre,
2002) and should be removed from a test as they do not
contribute to a unidimensional latent trait. No PSM6 items
had negative point biserial correlations (.21–.66), and all
but one item fell within Rasch MNSQ fit parameters for
both infit (.79–1.18) and outfit (.74–1.33) statistics. One
item had an appropriate point biserial (.49) and infit sta-
tistic (1.16), but had a slightly higher than expected outfit
statistic (1.59).
Conceptualization of the construct. When assessinginternal structure validity, it is important to look at how
the construct being studied relates to the predetermined
theoretical structure. We hypothesized the construct of
problem-solving ability, as measured by the PSM6, would
have a theoretical hierarchical structure with statistical and
probability items being easiest and geometry items as the
most difficult. Number sense, expressions and equations,
and ratio and proportion items were planned to vary in
difficulty based on the nature of the content addressed as
well as manner in which it was addressed. This hypothesis
stemmed from pilot testing results as well as impressions,
suggesting that some of the items’ content were commonly
found in elementary grade-level state mathematics stan-
dards during the era prior to the CCSSM. Figure 2 is a
variable map of the construct where people are on the left
and items are on the right.
T
T S
S
M
M
S
S
T
T
5
4
3
2
1
0
-1
-2
-3
-4
G2
G7
G14
RP10
SP13
NS5
NS11 RP/EE12 NS3
EE9
EE6
RP/EE15 RP/EE4
SP1 SP8
###
##
###
-######
-#####
######
-
-###
-#######
-
#####--
##########
--
-
-####
-
-
###
####
MEASURE PERSON – MAP – ITEM
<more ability> <difficult items>
<less ability> <easier items>
Figure 2. Variable map for PSM6. Each “#” is 2. Each “-” is 1. M denotes
mean. S indicates one standard deviation from the mean. T indicates two
standard deviations from the mean. Items are abbreviated by the domain(s)
addressed (e.g., geometry = G) and its location on the measure (e.g., second
item = 2).
Measuring
School Science and Mathematics 287
7/17/2019 Ssm 12130
http://slidepdf.com/reader/full/ssm-12130 8/11
Easier items and students with less problem-solving
ability are lower on the map, whereas more difficult items
and students with greater ability are higher. Items with a
difficulty level at the same measure as a student’s ability
level means that the student has a 50% chance of cor-
rectly answering the item. Items lower than a student’s
ability on the measure are easier for the student toanswer, and items above a student’s ability are more chal-
lenging for the student to answer correctly. Figure 2
graphically shows that the actual item difficulty hierarchy
for this measure aligns well with the hypothesized theo-
retical structure. However, there were some students not
able to answer any of the problem-solving items correctly,
and the geometry items were more difficult than student
ability of all participants in this study. Overall, results
from the assessment of unidimensionality and the vari-
able map suggest that validity evidence for the internal
structure of the PSM6 is high and the test is meaningfully
measuring the theorized construct of problem-solvingability.
Evidence From Relationship to Other Variables
Two student-level demographic variables were used to
assess relationship to other variables validity evidence:
gender and math class grouping. It was hypothesized that
gender should not impact problem-solving ability mea-
sures, and thus PSM6 scores would not significantly differ
by gender (nmale = 73, nfemale = 64). This hypothesis was
confirmed. No statistically significant differences were
found in problem-solving ability measures between males
( M =
1.28, SD =
1.78) and females ( M =
1.14, SD =
1.73);t (135) = .473, p = .637, two tailed.
In terms of mathematics class grouping, students were
either placed in one of two mathematics groups (i.e.,
tracking). One group consisted of two mathematics
classes with learners who typically performed below the
average score when compared with peers within the
school on high-stakes end-of-course tests (remedial;
n = 42). The second group consisted of four mathematics
classes with learners who typically showed average or
above-average performance on high-stakes end-of-course
tests when compared with the average score for the
school (regular; n = 95). It was hypothesized that studentsin the regular mathematics classes would perform signifi-
cantly better than students in the remedial mathematics
classes in terms of math problem-solving ability mea-
sures. Thus, our hypothesis was that PSM6 scores would
be significantly higher for students in the regular math-
ematics classes compared with those in the remedial
classes. This hypothesis was also confirmed. Students in
the remedial mathematics classes ( M = 2.35, SD = 1.59)
performed statistically significantly lower on the PSM6
compared with students in the regular mathematics
classes ( M = .71, SD = 1.58); t (135) = 5.56, p < .001, one
tailed.
Evidence From Consequences of Testing
Student identified feelings from taking test. Students
who participated in the cognitive (i.e., think-aloud) inter-views following the test were asked to share how they felt
after taking the measure. For instance, we wanted to know
whether the experience brought about negative feelings
that outweighed the benefits of knowing his/her problem-
solving ability. Did the test indicate any negative or posi-
tive biases toward respondents? Those who participated in
the think-aloud interviews shared that they did not expe-
rience feelings of negative affect beyond a typical testing
situation in the classroom. One below-average respondent
said, “It [the test] was hard. I felt like I could answer every
question but I’m not sure if they [answers] were right. I
want to know how I did.” During an interview, one above-average respondent shared, “The test was more difficult
than the tests we usually take in math. I like being chal-
lenged.” Drawing across these findings, there is a consen-
sus that the knowledge gained by taking the PSM6
outweighed potential negative factors from consequences
of testing. Thus, there was sufficient evidence for this
validity criterion.
Reliability Evidence
Rasch reliability is similar to traditional reliability
because both are the statistical reproducibility of a set of
values. However, Rasch computes reliabilities for itemdifficulties and person abilities, whereas CTT reliability
is computed for raw scores. Rasch reliability of .70 is
considered acceptable, .80 is good, and .90 is excellent
(Duncan, Bode, Lai, & Perera, 2003). For the PSM6,
Rasch item reliability was excellent at .97, suggesting
strong internal consistency for items.
Rasch separation indicates the number of statistically
significant groups that can be classified along a variable.
Computing separation is essential because a measure
“is useful only if persons differ in the extent to which
they possess the trait measured” (Bode & Wright, 1999,
p. 295). Rasch separation of 1.50 is considered accept-able, 2.00 is good, and 3.00 is excellent (Duncan et al.,
2003). Item separation was excellent (6.02) and item
measures ranged from −3.08 to 4.20 logits, indicating a
meaningful variable (i.e., problem-solving ability) is
being measured. Collectively, these statistics suggest
all items work together to form a measure capable
of reliably assessing a wide range of problem-solving
abilities.
Measuring
288 Volume 115 (6)
7/17/2019 Ssm 12130
http://slidepdf.com/reader/full/ssm-12130 9/11
Discussion
Filling Literature Gaps
One of the major contributions of our study is that it
fills multiple gaps in the literature. As seen in Table 2, we
were unable to locate a validated and published problem-
solving measure that addressed mathematics content stan-
dards at either the state or national level. The PSM6addresses the sixth-grade SMCs, which at present time
have been adopted by 45 states in the United States. Thus,
there is an agreed-upon content framework underlying this
measure so that researchers have a means to explore sixth-
grade students’ problem solving that is content relevant to
most students across the United States. Further, the PSM6
engages students in mathematical behaviors and habits
such as persevering through content-appropriate problem-
solving tasks, attending to precision, and modeling with
mathematics. By providing students with items that are
open, complex, and realistic in nature, students are engag-
ing with problems rather than simple exercises meant to promote efficiency with a known procedure (Kilpatrick
et al., 2001). With the validation of the PSM6, English-
speaking countries have an instrument to assess students’
problem solving much like Verschaffel et al. (1999) with
the Belgian WP tests. Hence, we are hopeful that this
measure provides a tool for future research into students’
mathematical problem solving.
PSM6 Assessment Rigor
The PSM6 includes quite challenging items; the
measure itself is difficult for the average sixth-grade
respondent. Problem solving requires different cognitivethinking than that typically needed for many classroom
mathematics tasks (Lesh & Zawojewski, 2007;
Schoenfeld, 2011; Verschaffel et al., 1999). Students, on
average, correctly answered approximately 6 items on the
15-item measure. While this seems unusually low, the
average score is similar to those found in past investiga-
tions of middle-school students’ problem solving (Charles
& Lester, 1984; Verschaffel et al., 1999).
Furthermore, this should not be a surprise when consid-
ering an item with respect to Bloom’s revised taxonomy
(Kratwohl, 2002). Kratwohl suggests six hierarchical cog-
nitive processes: remembering, understanding, applica-tion, analysis, evaluation, and creation. A problem-solving
task found on the PSM6 might be minimally classified as
an analysis-level task because respondents are asked to
break material into parts, decide how the parts relate to
one another and the overall item structure, and finally act
on the parts (Kratwohl, 2002). On the other hand, tradi-
tional mathematics achievement tasks do not necessarily
have to meet this threshold. Consider the following task
from a sixth-grade mathematics achievement test: “The
Tasty Soup Company uses 200 square inches of material to
make each 500-milliliter soup can. What does the 200
square inches represent?” (Ohio Department of Education,
2013). This item, which is common format on a mathemat-
ics standardized achievement test, asks respondents to
interpret the meaning of written communication, whichmeets Kratwohl’s description for an understanding-level
task. Thus, tasks found on the PSM6 will likely meet or
exceed the cognitive complexity found on most traditional
standardized mathematics achievement measures. Hence,
scores on the PSM6 should be lower than scores on tradi-
tional standardized mathematics achievement measures.
Instrument Validation Importance
Having confidence that results from mathematics mea-
sures are consistently measuring (reliability evidence)
what we expect them to measure (validity evidence) is
critical in mathematics education research. However, it is
not appropriate to attempt meaningful judgment making based on test results if the measures are not shown to be
empirically sound. In our validation study, multiple ana-
lytic methods (i.e., qualitative, statistical, and Rasch mea-
surement) allowed us to demonstrate that the PSM6 has
met minimum criteria in many areas of validity and reli-
ability evidence. To further advance the field of mathemat-
ics ability assessment, more instruments need to be
developed and empirically evaluated through the use of
more modern analytical approaches. Our description of
Rasch modeling and results from its use might assist
others to develop high-quality instruments with sufficientevidence for validity and high internal consistency.
Significance and Implications
The primary purpose of this study was to explore the
psychometric properties of the PSM6. Results from this
validation study indicate that the PSM6 met the criteria for
measuring sixth-grade students’ problem-solving abilities
within the context of the CCSSM. This study provides a
tool to study sixth-grade students’ problem-solving ability
on items addressing content described in the mathematics
standards adopted by numerous states. Moreover, the
PSM6 engages students in problems that encourage them
to engage in mathematical behaviors and habits describedin the SMPs. The items are characterized as being open,
complex, and realistic problems. Ones like these are
needed to stimulate problem-solving and critical thinking
skills during classroom instruction and assessment.
Solving problems on the PSM6 was challenging for most
students, but not outside of their cognitive grasp.
With this in mind, the PSM6 is not a high-stakes
measure and we do not encourage its use in that fashion. It
Measuring
School Science and Mathematics 289
7/17/2019 Ssm 12130
http://slidepdf.com/reader/full/ssm-12130 10/11
does, however, provide useful information about sixth-
grade students’ problem-solving ability while also
addressing CCSSM found across all sixth-grade domains.
Researchers as well as school personnel may feel confi-
dent using this measure to gather data about students’
problem-solving abilities, which are a key feature of the
CCSSM. Additionally, the PSM6 is not a measure of general sixth-grade mathematics achievement, and we do
not advocate for its use in that manner either. It has been
shown to reliably assess students’ problem-solving abili-
ties within a specific content framework and there is suf-
ficient validity evidence for its use in this context only.
Limitations and Future Directions
One limitation of any problem-solving measure intend-
ing to meet the aim of being relevant to content standards
is that only 45 of the 50 states across the United States
have adopted the CCSSM. Thus, the mathematics found in
this measure is not addressed by standards universally
adopted across the United States. A second limitation isthat the items are somewhat more challenging than the
students’ abilities (as can be seen in the variable map in
Figure 2); however, this should be expected on a problem-
solving measure compared with one consisting of rote
exercises because problem-solving typically is more chal-
lenging for students to master compared with completing
exercises (Kilpatrick et al., 2001). While this minor
student ability and item difficulty misalignment exists, the
PSM6 has exceeded the minimum criteria for providing
reliable and valid information about students’ problem-
solving abilities.One future area of research to explore related to the
PSM6 pertains to what teachers can do in the CCSSM era
to support students’ problem solving. Initial research
using the PSM6 suggests that providing students with
problems that address CCSSM content and practices
support students’ problem-solving performance (Bostic
et al., in press). A second future direction of study is to
develop similar measures for other grade levels as well as
another version that includes items from other grade
levels. This latter version might provide useful data that
teachers might seek to consider whether a student might be
ready to matriculate to the next grade-level mathematicscontent course or how to best enrich individual students’
mathematics education.
ReferencesAmerican Educational Research Association (AERA), American Psychologi-
cal Association, & National Council on Measurement in Education.
(2014). Standards for educational and psychological testing . Washington,
DC: American Educational Research Association.
Boaler, J., & Staples, M. (2008). Creating mathematical future through an
equitable teaching approach:The case of Railside school. Teachers College
Record , 110, 608–645.
Bode, R., & Wright, B. (1999). Rasch measurement in higher education. In
J. Smart & W. Tierney (Eds.), Higher education handbook of theory and
research (Vol. XIV , pp. 287–316). Netherlands: Springer.
Bond, T., & Fox, C. (2007). Fundamental measurement in the human sciences
(2nd edn). Mahwah, NJ: Erlbaum.
Bostic, J. (2015). A blizzard of a value. Mathematics Teaching in the MiddleSchool , 20, 350–357.
Bostic, J., Pape, S., & Jacobbe, T. (2011). Validating two problem-solving
instruments for use with sixth-grade students. In L. Wiest & T. Lamberg
(Eds.), Proceedings of the 33rd annual meeting of the North American
chapter of the international group for the psychology of mathematics
education (pp. 756–763). Reno, NV: University of Nevada – Reno.
Retrieved from http://www.pmena.org/html/proceedings.html
Bostic, J., Pape, S., & Jacobbe, T. (in press). Encouraging sixth-grade
students’ problem-solving performance by teaching through problem
solving. Investigations in mathematics learning . Retrieved from http://
scholarworks.bgsu.edu/teach_learn_pub/31/
Charles, R., & Lester, J. F. (1984). An evaluation of a process-oriented instruc-
tional program in mathematical problem solving in grades 5 and 7. Journal
for Research in Mathematics Education, 15, 15–34.
Crocker, L., & Algina, J. (2006). Introduction to classical and modern test
theory (2nd edn). Mason, OH: Wadsorth Publishing.
De Ayala, R. (2009). The theory and practice of item response theory. New
York: Guilford Press.
Duncan, P., Bode, R., Lai, S., & Perera, S. (2003). Rasch analysis of a new
stroke-specific outcome scale: The stroke impact scale. Archives of Physi-
cal Medicine and Rehabilitation, 84, 950–963.
Embretson, S., & Reise, S. (2000). Item response theory for psychologists.
Mahwah, NJ: Erlbaum.
Gall, M., Gall, J., & Borg, W. (2007). Educational research: An introduction
(8th edn). Boston: Pearson.
Glaser, B., & Strauss, A. (1967/2012). The discovery of grounded theory:
Strategies for qualitative research. Mill Valley, CA: Sociology Press.
Hatch, A. (2002). Doing qualitative research in education settings. Albany,
NY: State University of New York Press.Kanold, T., & Larson, M. (2012). Common core mathematics in a PLC at
work: Leader’s guide. Bloomington, IN: Solution Tree Press.
Kilpatrick, J., Swafford, J., & Findell, B. (2001). Adding it up: Helping
children learn mathematics. Washington, DC: National Academy
Press.
Kincaid, J., Fishburne, R., Rogers, R., & Chissom, B. (1975). Derivation of
new readability formulas (Automated Readability Index, Fog Count and
Flesch Reading Ease Formula) for Navy enlisted personnel. Research
Branch Report 8–75. Millington, TN: Naval TechnicalTraining, U. S. Naval
Air Station, Memphis, TN.
Koestler, C., Felton, M. D., Bieda, K. N., & Otten, S. (2013). Connecting the
NCTM process standards and the CCSSM practices. Reston, VA: National
Council of Teachers of Mathematics.
Kratwohl, D. (2002). A revision of bloom’s taxonomy: An overview. Theory
into Practice, 41(4), 212–218.
Lesh, R., & Zawojewski, J. (2007). Problem solving and modeling. In F. Lester
Jr. (Ed.), Second handbook of research on mathematics teaching and learn-
ing (pp. 763–804). Charlotte, NC: Information Age Publishing.
Linacre, J. (2002). Optimizing rating scale category effectiveness. Journal of
Applied Measurement , 3(1), 85–106.
Linacre, J. (2012). Winsteps (Version 3.74) [Computer Software]. Beaverton,
OR: Winsteps.com.
Matney, G., Jackson, J., & Bostic, J. (2013). Connecting instruction, minute
contextual experiences, and a realistic assessment of proportional reason-
ing. Investigations in Mathematics Learning , 6 , 41–68.
Measuring
290 Volume 115 (6)
7/17/2019 Ssm 12130
http://slidepdf.com/reader/full/ssm-12130 11/11
Mayer, R., & Wittrock, M. (2006). Problem solving. In P. Alexander &
P. Winne (Eds.), Handbook of educational psychology (pp. 287–303).
Mahwah, NJ: Erlbaum.
National Center for Education Statistics. (2009). The nation’s report card:
Mathematics 2009 ( NCES 2010-451). Retrieved from http://nces.ed.gov/
nationsreportcard/pubs/main2009/2010451.asp
NCTM. (1989). Curriculum and evaluation standards for school mathemat-
ics. Reston, VA: Author.
NCTM. (2000). Principles and standards for school mathematics. Reston,VA: Author.
NCTM. (2009). Focus in high school mathematics: Reasoning and sense
making . Reston, VA: Author.
NGA, & CCSSO. (2010). Common core standards for mathematics. Retrieved
from http://www.corestandards.org/assets/CCSSI_Math%20Standards.pdf
Ohio Department of Education. (2013). Grade 6 – Released tests
materials. Retrieved from http://education.ohio.gov/Topics/Testing/
Testing-Materials/Released-Test-Materials-for-Ohio-s-Grade-3-8-Achie/
Grade-6-Released-Tests-Materials
Organization for Economic Development. (2010). PISA 2009 results: What
students know and can do—student performance in reading, mathematics
and science Vol. ( I ). Retrieved from http://www.keepeek.com/Digital
-Asset-Management/oecd/education/pisa-2009-results-what-students
-know-and-can-do_9789264091450-en#page1
Palm, T. (2006). Word problems as simulations of real-world situation: A
proposed framework. For the Learning of Mathematics, 26 , 42–47.
Polya, G. (1945/2004). How to solve it . Princeton, NJ: Princeton University
Press.
Rasch, G. (1960/1980). Probabilistic models for some intelligence and attain-
ment tests. Copenhagen: Denmarks Paedagoiske Institut.
Schoenfeld, A. (2011). How we think: A theory of goal-oriented decision
making and its educational applications. New York: Routledge.
Smith, E., Conrad, K., Chang, K., & Piazza, J. (2002). An introduction to
Rasch measurement for scale development and person assessment. Journal
of Nursing Measurement , 10, 189–206.
Smith, R. (1996). A comparison of methods for determining dimensionality in
Rasch measurement. Structural Equation Modeling , 3, 25–40.
Strauss, A., & Corbin, J. (1998). Basics of qualitative research techniques and
procedures for developing grounded theory. London: Sage.Verschaffel, L., De Corte, E., Lasure, S., Van Vaerenbergh, G., Bogaerts, H.,
& Ratinckx, E. (1999). Learning to solve mathematical application prob-
lems: A design experiment with fifth graders. Mathematical Thinking and
Learning , 1, 195–229.
Waugh, R., & Chapman, E. (2005). An analysis of dimensionality using factor
analysis (true-score theory) and Rasch measurement: What is the differ-
ence? Which method is better? Journal of Applied Measurement , 6 , 80–99.
Wright, B. D. (1996). Comparing Rasch measurement and factor analysis.
Structural Equation Modeling , 3, 3–24.
Authors’ Notes
Keywords: learning processes, problem solving, stu-
dents and learning, student assessment.Correspondence concerning this article should be
addressed to Jonathan David Bostic, School of Teaching
and Learning, Bowling Green State University, 529 Edu-
cation Building, Bowling Green, OH, USA. E-mail:
A Research to Practice article based on this paper
can be found alongside the electronic version at http://
wileyonlinelibrary.com/journal/ssm.
Measuring
School Science and Mathematics 291