ssm 12130

7/17/2019 Ssm 12130

http://slidepdf.com/reader/full/ssm-12130 1/11

Measuring Sixth-Grade Students’ Problem Solving: Validating an

Instrument Addressing the Mathematics Common Core

Jonathan David Bostic Bowling Green State University

Toni A. Sondergeld Bowling Green State University

This article describes the development of a problem-solving instrument intended for classroom use that addresses theCommon Core State Standards for Mathematics. In this study, 137 students completed the assessment, and their

responses were analyzed. Evidence for validity was collected and examined using the current standards for educational

and psychological testing. Instrument validation findings regarding internal consistency reliability were high, and

multiple forms of validity (i.e., content, response processes, internal structure, relationship to other variables, and

consequences of testing) were found to be appropriate. The resulting instrument provides teachers and researchers with

a sound tool to gather data about sixth-grade students’ problem solving in the Common Core era.

Problem solving has been a notable theme within math-

ematics education (National Council of Teachers of Math-

ematics [NCTM, 1989, 2000, 2009]) and the importance is

clearly seen in the Common Core State Standards for Mathematics (CCSSM; National Governors Association

and Council of Chief State School Officers [NGA &

CCSSO, 2010]). A central feature of the CCSSM is a keen

focus on mathematical problem solving, which is high-

lighted as its own Standard for Mathematical Practice

(SMP) but also woven throughout several Standards for

Mathematics Content (SMCs; Kanold & Larson, 2012).

New standards mean that old measures of student learning

need revisions and revalidation, or new measures must be

created and validated to insure alignment of classroom

curriculum and instruction with assessment. The purposeof this study was to pilot and validate a new measure of

sixth-grade students’ problem-solving abilities addressing

CCSSM content and discuss its potential for future use.

Related Literature

Problems and Problem-Solving Framework

Problems are characterized as tasks that meet the

following criteria: (a) It is unknown whether a solution

exists, (b) a solution pathway is not readily determined,

and (c) there exists more than one way to answer the task

(Schoenfeld, 2011). Problems are distinct from exercises

(Kilpatrick, Swafford, & Findell, 2001), and problemsolving goes beyond the type of thinking needed to solve

exercises (Mayer & Wittrock, 2006; Polya, 1945/2004).

Lesh and Zawojewski (2007) characterize problem solving

as involving “several iterative cycles of expressing, testing

and revising mathematical interpretations—and of sorting

out, integrating, modifying, revising, or refining clusters

of mathematical concepts from various topics within and

beyond mathematics” (p. 782). Many, including CCSSM

authors, have suggested that students ought to experience

developmentally appropriate tasks that are open, realistic,

and complex (Boaler & Staples, 2008; Bostic, Pape, &

Jacobbe, in press; Palm, 2006; Verschaffel et al., 1999).These sorts of tasks are often found in outside-of-school

contexts (Boaler & Staples, 2008; Bostic et al., in press)

and they provide opportunities for students to demonstrate

critical thinking (Bostic, 2015; Lesh & Zawojewski, 2007;

Matney, Jackson, & Bostic, 2013). “Open” tasks can be

solved in different ways and offer learners multiple entry

points while problem solving. “Realistic” tasks draw upon

a problem solver’s experiential knowledge and engage the

student in a task that might occur in the real world.

“Complex” tasks require an individual to persevere and

employ sustained reasoning to solve it. Such open, realis-tic, and complex tasks offer opportunities for students to

exhibit mathematical behaviors and habits described in the

SMPs that connote problem solving (NGA & CCSSO,

2010; see Table 1). The SMPs are connected to similar

mathematics behaviors and habits described in character-

izations of mathematical proficiency (Kilpatrick et al.,

2001) as well as the NCTM’s process standards (NCTM,

2000). Thus the SMPs are not necessarily new ideas;

instead, they are clearly “linked to mathematical goals

articulated in previous documents and by other groups”

(Koestler, Felton, Bieda, & Otten, 2013, p. v).

With these mathematical behaviors and habits in mind, itis necessary to create measures that assess students’ math-

ematics content knowledge through open, complex, and

realistic tasks addressing the CCSSM content and practice

standards.

Measures of Problem-Solving Ability

A review of the literature through multiple scholarly

search engines (e.g., EBSCO, Google Scholar, and

Science Direct) demonstrated that content standards at the

School Science and Mathematics 281

7/17/2019 Ssm 12130


state and national levels have not been assessed in current

problem-solving measures. Table 2 provides a list of

merely four measures found in the literature that address

mathematical problem solving (not mathematical achieve-

ment). All were discussed in peer-reviewed journals, and

evidence for validity of the measures was shared. Mea-

sures published in journals and books without peer reviewand/or measures without evidence for validity were not

considered in our review.

Previous problem-solving measures for middle-school

students can be described as using one of two types of

problem-solving measures. The first set includes analysis

of large-scale data sets such as the Programme for Inter-

national Student Assessment and National Assessment

of Educational Progress (Organization for Economic

Development, 2010; National Center for Education

Statistics, 2009). The second set of studies draws on

locally constructed measures (e.g., Charles & Lester,

1984; Verschaffel et al., 1999). Taken collectively, thesestudies lay a foundation for examining middle-grade stu-

dents’ problem-solving ability. They also suggest a need

for measures that support assessment of students’ math-

ematical problem solving such that the mathematics

covered in the items addresses the mathematics content

that students are expected to learn in their mathematics

classes (i.e., standards-aligned assessments). Our study

aims to validate a new measure of problem solving that

will work toward meeting this need.

Belgian word-problem tests. The initial design of the

test described in this manuscript stems from two previous

problem-solving word-problem (WP) tests constructed to be parallel in nature (different items with the same

content) for use with Belgian fifth-grade students

(Verschaffel et al., 1999). The goal of their investigation

was to explore the impact of supplementing typical math-

ematics instruction with problem-solving instruction,

specifically researching students’ problem-solving perfor-

mance. They created two parallel measures (WP pretest

and WP posttest) composed of 10 open, realistic, and

complex WP. An item from the WP pretest states, “Martha

is reading a book. Suddenly she finds out that some pages

are missing because page 135 is immediately followed by

page 173. How many pages are missing?” (Verschaffelet al., 1999, p. 214). Each item on one test had a similar

but not identical task on the other test, which was assumed

to be parallel in content and difficulty. Verschaffel’s

research provided two problem-solving instruments (WP

pretest and posttest) that most closely met the intent of our

work, with similar aged students. Hence their work

grounded construction of our measure.

In a validation study, a total of 232 Belgian fifth-grade

students completed Verschaffel et al. (1999) problem-

solving measures. Internal consistency results suggested

moderate levels of reliability for their measures,Cronbach’s α = .56 (pretest) and .75 (posttest). Items and

measures were deemed mathematically correct and devel-

opmentally appropriate by an expert panel consisting of

mathematicians and mathematics educators. Furthermore,

the panel agreed that items were open, complex, and real-

istic to students completing the measures. Results indi-

cated that students’ mean score across both measures

averaged 1.6 correct responses on a 10-item measure. This

Table 1

Standards for Mathematical Practice

SMP # Title

1 Make sense of problems and persevere in solvingthem.

2 Reason abstractly and quantitatively.3 Construct viable arguments and critique the

reasoning of others.4 Model with mathematics.5 Use appropriate tools strategically.6 Attend to precision.7 Look for and make use of structure.8 Look for regularity in repeated reasoning.

Table 2

Characteristics of Measures Developed to Assess Middle-School Students’ Problem Solving

Measure name Author (Year) Format Age/gradelevel

Aligned with stateor national standards

Programme for InternationalStudent Assessment (PISA)

Organization for EconomicDevelopment (2010)

Multiple choice and constructed response

15–16 yearsold

None indicated

National Assessment of Educational Progress (NAEP)

National Center for EducationStatistics (2009)

Multiple choice and constructed response

8th grade None indicated

No Name Verschaffel et al. (1999) Constructed response 5th grade None indicated No Name Charles and Lester (1984) Construct response 5th and 7th

grades None indicated

Measuring

282 Volume 115 (6)

7/17/2019 Ssm 12130


result provides support that problem solving is in fact

difficult for students to master.

These measures were used to create an initial problem-

solving instrument for use in the United States (Bostic,

Pape, & Jacobbe, 2011).To do this, a multistep process was

completed. First, an individual who previously taught

Dutch at the university level translated the instruments intoEnglish. Second, items consisting of only one sentence

were not used in the English version because they were

significantly shorter than most tasks. For instance, the

readability score using Flesch–Kincaid analysis (Kincaid,

Fishburne, Rogers, & Chissom, 1975) was much lower on

items consisting of one sentence compared with others.

Third, problems were revised to update contexts, to reflect

U.S. students’ experiences, and to clarify the language.

Finally, three rounds of pilot testing, collection of evidence

for validity, analysis of psychometric properties, and revi-

sion supported creating the measure that is the focus of this

paper.Validity and Reliability of Tests

To be considered a sound measure, tests should provide

multiple pieces of evidence for validity as well as reliability

(American Educational Research Association [AERA],

American Psychological Association, & National Council

on Measurement in Education, 2014; Gall, Gall, & Borg,

2007). Sufficient validity evidence is needed to determine

the degree to which interpretations of test scores are sup-

ported by use of the tests (AERA et al., 2014; Gall et al.,

2007). Greater validity evidence leads to stronger confi-

dence in the interpretations of score reports. There arenumerous types of validity evidence discussed in research

literature; the “five main types of evidence for demonstrat-

ing the validity of test-score interpretations [are] evidence

from: test content, response processes, internal structure,

relationship to other variables, and consequences of

testing” (Gall et al., 2007, p. 195). Test content validity

evidence indicates the degree to which content (or items)

found on the measure addresses the construct of interest.

This is a judgment call typically determined by a panel of

content experts. Response processes validity evidence sug-

gests how the processes thought to be engaged in by the

respondent when completing the test are consistent with aknown construct (Gall et al., 2007). One technique for

gathering response process evidence is to conduct cognitive

interviews or think-aloud interviews with participants rep-

resentative of those who would be expected to complete the

assessment. Internal structure evidence describes how one

item relates to others. Evaluating how the items work

together as a construct through traditional methods (i.e.,

factor analysis) or modern measurement (i.e., Rasch analy-

sis) are typical means of gathering evidence in this area.

Validity evidence for relationship to other variables may

take on various forms. Test makers might seek convergent

or divergent evidence that examines how test scores corre-

late with scores on other measures (i.e., similarly and

differently, respectively). Another approach is to examine

how score distributions compare for two groups hypoth-esized to be similar or different. Finally, consequences of

testing validity describes how the values implied by and

consequences of test taking a test impact respondents.

Interviews with respondents following measure adminis-

tration may add evidence indicating to what degree test

taking influenced respondents affectively and cognitively.

Validity evidence is necessary but not sufficient for a

high-quality test (AERA et al., 2014); thus test construc-

tion must also include an examination of internal consis-

tency (i.e., reliability). Internal consistency reliability

estimates the “coefficient of precision from a set of

real test scores” (Crocker & Algina, 2006, p. 117). Thereare multiple internal consistency measures such as

Cronbach’s α , Rasch reliability, and Raykov’s approach.

There is no best method for assessing internal consistency

because each has strengths and limitations. Historically,

classical testing theory approaches were used to examine

psychometric properties of measures (Crocker & Algina,

2006). However, in the last 30 years, an alternate psycho-

metric analysis framework arose called item response

theory (IRT). IRT has become a popular approach to inves-

tigate psychometric properties of tests because of its

advantages over classical testing theory in multiple ways.We discuss one type of IRT analysis, Rasch modeling

(Rasch, 1960/1980), and some differences between it and

classical testing theory in the next section as a way of

examining item difficulty and item discrimination as well

as the properties of the test as a whole.

Item Response Theory: Rasch Modeling

Rasch modeling, often referred to as one-parameter IRT,

has four key statistical assumptions: (a) Ability is a unidi-

mensional trait, (b) items are locally independent, (c) the

probability of correctly answering items increases as ability

increases, and (d) item parameters are independent of

respondents’ abilities (De Ayala, 2009; Embretson &Reise, 2000). Rasch modeling has multiple benefits over

Classical Testing Theory, or CTT (De Ayala, 2009;

Embretson & Reise, 2000). First, Rasch methods lead to

results that offer trait-level estimates of an individual’s

ability that depend on an individual’s responses and item

properties (Embretson & Reise, 2000). That is, Rasch mod-

eling allows individuals to be measured against the con-

struct (i.e., criteria or items) rather than a norm-referenced

Measuring


7/17/2019 Ssm 12130


sample, which elicits a criterion-referenced interpretation

of results.Thus, multiple populationscan be compared with

one another, which cannot necessarily be done with CTT

because results are sample dependent with CTT.

A second benefit of using Rasch methods is that mea-

sures using a Rasch framework are likely to offer more

accurate estimates of problem-solvers’ abilities (De Ayala,2009; Embretson & Reise, 2000). For example, let us

suppose that two students, Student A and Student B, both

correctly answer 6 items on a 10-item assessment. Student

A correctly answers the six easiest items where Student B

correctly answers the six most difficult items. With CTT,

both students earn a 60% regardless of item difficulty. If

using Rasch methods, due to the conjoint measurement

model placing items and people on the same ruler, then

Student B earns a higher score than Student A because

item difficulty is taken into consideration.

Third, Rasch modeling approaches require fewer total

items to accurately measure someone’s ability compared with CTT approaches (De Ayala, 2009), and missing data

are not problematic when using Rasch modeling because

of the probabilistic nature of the conjoint item/person

measure (Bond & Fox, 2007). In practice, this means that

a test taker can fail to complete the test or skip items they

are unsure of, and Rasch methods still have the capacity to

estimate an accurate person’s ability measure based on the

data that were collected by relating the correct item

responses to their item difficulty.

Finally, each test completer has his/her own standard

error estimate rather than assuming one for all respondents(Embretson & Reise, 2000), which elicits more precise

ability measures. In summary, using Rasch methods for

measure creation and refinement is considered one of

the best approaches by many social science researchers

because of its ability to convert ordinal data into conjoint,

hierarchical, equal-interval measures that place both

person abilities and item difficulties on the same scale so

they can be directly compared with each other (see Bond

& Fox, 2007).

With so many advantages in using Rasch modeling over

CTT being noted in terms of assessment construction

and validity assessment (i.e., Smith, 1996; Waugh &Chapman, 2005; Wright, 1996), one might wonder why

more modern measurement techniques still fail to domi-

nate the field of test construction in comparison with CTT.

Two main reasons have been cited for this discrepancy

(Smith, Conrad, Chang, & Piazza, 2002). First, CTT

assumptions are considered “weak” and easily met by test

developers, whereas Rasch specifications are stricter and

at times render data unusable if they do not fit the model.

Second, although traditional CTT statistics are taught in

introductory courses to most graduate students, Rasch (or

IRT) measurement methods are only taught in more

advanced courses and to far fewer individuals. In fact,

many universities do not offer such courses to their gradu-

ate students at all. With the advantages of Rasch modeling

in mind, along with the need to revise or develop newCCSSM-aligned assessments, it is an ideal time to employ

more advanced measurement methods to advance research

on middle-grades students’ problem solving.

Objectives of This Study

In this study, we examine the psychometric qualities

(i.e., validity and reliability evidence discussed earlier) of

a new measure of student mathematics problem-solving

ability focusing on grade six SMCs. The measure is called

the Grade 6 Problem-Solving Measure (PSM6). Our

overarching research question is: What are the psychomet-

ric properties for the PSM6? We share evidence for the five

main types of validity (test content, response processes,internal structure, relationship to other variables, and con-

sequences of testing), as well as test bias, construct valid-

ity, internal consistency reliability, unidimensionality, item

difficulty to student ability targeting, and item function.

Method

Data Sources, Collection, and Procedures

Grade 6 problem-solving measure. The PSM6 has

evolved over the course of previous investigations. One

previous form and its related psychometric properties (i.e.,

internal consistency and item characteristics) are discussed in Bostic et al. (2011). Several revisions have taken place

since that publication. First, more items drawing on the

same latent trait and content area (e.g., geometry, statistics,

expressions, and equations, etc.) were added to produce

better estimates of sixth-grade students’ problem-solving

abilities. Previous versions omitted the geometry domain,

whereas the current version contains three geometry items

aligning it better with CCSSM. A second issue was creat-

ing easier and/or more difficult items for each domain. For

instance, previous versions had statistics and probability

items that were correctly solved by a large percentage of

respondents and considered easier items. There were nodifficult statistics and probability items related to this

content area to balance the easier items hence revisions

were needed to rectify this issue.An item from the PSM6 is

located in Figure 1.

When distributed to students, one item was displayed on

each page and figures were large in size for ease in read-

ability. Item descriptions and their associated SMCs are

presented in Table 3.

Measuring

284 Volume 115 (6)

7/17/2019 Ssm 12130


Grade 6 students from a suburban school district in aMidwest state were contacted during the last nine-week

period of the academic year about their willingness to

complete the PSM6. A total of 137 students volunteered

to complete it during one administration. None of the

respondents was an English-language learner or receiving

services for a disability or giftedness. Test administration

took approximately 75 minutes.

Qualitative data. Various sources of qualitative data

were collected to inform the findings of this study.The first

set of data came from a content expert panel consisting of

one mathematician holding a Ph.D., two university-level

mathematics educators holding terminal degrees, and twosixth-grade teachers. Panel members shared that they were

familiar with the SMC and SMPs. They were asked to

examine the PSM6 and consider the following questions

for each item.

• Does the item address one or more sixth-grade

SMCs? If so, which one(s)?

• Does the item provide a context to engage students in

the SMPs?

• Is there more than one developmentally appropriate(i.e., grade level or lower) way to solve the problem?

• Is the item complex enough to be considered a

problem?

• Does the item draw on realistic contexts that students

might recognize?

• Do you perceive bias in the item toward any group of

individuals?

The mathematician also responded to two additional

prompts.

• Does the item have a well-defined solution set?

• Is the mathematics in the item accurate and presented

clearly?A second set of qualitative data stemmed from cognitive

interviews with 10 students. The purpose of these inter-

views was twofold. One purpose was to assess whether

students could read and comprehend the words in the

items so they might understand the situations embedded

within them. The second purpose was to assess students’

feelings about solving these problems and more generally,

mathematical problem solving. There were five boys and

A group of 150 tourists were waiting for a shuttle to take them from a parking lot to a theme

park’s entrance. The only way they could reach the park’s entrance was by taking this shuttle.The shuttle can carry 18 tourists at a time. After one hour, everyone in the group of 150 tourists

reached the theme park’s entrance. What is the fewest number of times that the shuttle picked

tourists up from the parking lot?

Figure 1. Sample PSM6 item. Item description is “Tourist shuttle.” Connections to SMCs are found in Table 3.

Table 3

Connections to Standards for Mathematics Content

Question # Description Standard for mathematics content

Revised or added Primary* Secondary

1 Ice cream No revision 6.SP.1 – 2 Lightning bolt Added 6.G.1 – 3 Wooden gate No revision 6.NS.3 6.NS.14 Tourist shuttle No revision 6.RP.3 6.EE.75 Water park No revision 6.NS.3 – 6 Silly bandz No revision 6.EE.7 – 7 Sam’s box Added 6.G.2 – 8 Sandhill lunch No revision 6.SP.1 – 9 Bicycle No revision 6.EE.2 –

10 Jerome’s paint Revised 6.RP.3 – 11 Youth group No revision 6.NS.3 – 12 Animal day care No revision 6.RP.3 6.EE.713 Julie’s fish Added 6.SP.5 – 14 Pyramid Added 6.G.4 6.G.115 Glass bottom boat No revision 6.RP.3 6.EE.7

Note. *Descriptions of each CCSSM domain abbreviation are EE = expressions and equations; G = Geometry; NS = number sense;RP = ratio and proportion, and SP = statistics and probability.

Measuring


7/17/2019 Ssm 12130


five girls and each one characterized himself or herself as

African American, Caucasian (not Hispanic or Latino/a),

Hispanic, or Caucasian (Hispanic or Latino/a). Mathemat-

ics teachers assisted with sample selection diversity by

suggesting students who typically performed below

average, on average, and above average in comparison

with their peers. Thus, at least one boy and one girl wererepresentative of each ability.

Students were presented with an introductory task to

prime them for thinking aloud. Next, they were given one

task at a time. An interviewer asked students to solve

problems and voice their thinking during problem solving.

After solving the problem, the interviewing asked the

student whether he/she might show another way to solve

the same problem. When the student could not share a

different way to solve the problem, the interviewer pro-

ceeded to the next item. The interviewer did not answer

students’ questions but instead encouraged them to reread

the task and think about what it asked. At the end of theinterview, students were asked to describe how they felt

about the items and the process of solving these problems.

Data Analysis

Quantitative and qualitative data analyses were both

employed in this study depending on the data type. Quali-

tatively, inductive analysis (Hatch, 2002) was conducted to

assess content validity evidence (expert review), test bias

(expert review and student cognitive interviews), and con-

sequences of test-taking validity evidence (student cogni-

tive interviews). Interview data were transcribed and

salient themes that emerged across domains were created using inductive analysis as well. In sum, inductive analysis

is the process of identifying salient themes from data sets

(Glaser & Strauss, 1967/2012; Hatch, 2002; Strauss &

Corbin, 1998). Our implementation of Hatch’s (2002)

approach is shared to understand how we arrived at our

findings. First, we familiarized ourselves with the qualita-

tive data by rereading all materials (e.g., expert panel

reviews, transcribed cognitive interviews, etc.). Next,

initial ideas stemming from this rereading were recorded

as memos to consider during later analyses. Then, we

reflected on those memos in order to draw out key impres-

sions needed for validity evidence. Later, counter evidencefor our impressions was sought. Impressions with a lack of

strong counter evidence remained as emergent themes to

share as our qualitative findings.

All remaining forms of validity and reliability evidence

were analyzed quantitatively. Response process validity

was analyzed descriptively with raw scores. Relationship

to other variables validity was analyzed using independent

samples t -tests to compare demographic differences in

problem-solving ability (i.e., gender and ability level).

SPSS for Windows, Release Version 17.0 (SPSS, Inc.,

2008, Chicago, IL, USA; http://www.ibm.com/software/

analytics/spss) was used for these statistical analyses. Psy-

chometric properties related to all other forms of validity

and reliability evidence were analyzed using Rasch

methods for dichotomous responses (Rasch, 1960/1980)using Winsteps Version 3.74.0 (Linacre, 2012).

Psychometric Results for PSM6

Validity Evidence From Test Content

Expert panel content review. The expert review panel

had favorable reviews of the items in terms of CCSSM

content alignment and item focus on problem solving.

They reported the following themes: (a) Items addressed

content and practices found in the sixth-grade CCSSM, (b)

items were complex enough that a solution was not imme-

diately recognized, (c) items were open enough to be

solved in at least two unique ways, and (d) items drew onrealistic contexts. Furthermore, every item had a well-

defined solution set and the mathematics was correct and

accurate for the problems’ situations.

Readability from cognitive interviews and Flesch–

Kincaid analysis. Results from the cognitive interviews

indicated that below-average, average-, and above-average

performing students were able to read and solve problems

on the PSM6. Readers voiced ideas about problem solving

during the interview that matched their written work and

furthermore were found to be similar to work seen on

the larger data set. As a complement to the interview,readability for the test was calculated using the Flesch–

Kincaid grade-level indicator (Kincaid et al., 1975). The

Flesch–Kincaid grade level was 4.5. This suggests that a

fifth-grade student could read the PSM6, surpassing

expectations for a sixth-grade assessment. Collectively,

raw score test statistics and readability measures (i.e., cog-

nitive interviews and Flesch–Kincaid rating) suggest that

response process validity evidence was high.

Test Bias

Expert panel identified test biases. In addition to the

student interviews, we asked the expert panel to share

whether they noticed any biases arising from test admin-istration. Specifically, we asked them to consider cultural

(i.e., ethnicity and community-based), religious, and

gender-based biases found in the PSM6 that might influ-

ence respondents’ outcomes. The committee felt that there

were no evident biases that helped or inhibited a problem-

solver’s outcomes.

Student-identified test biases. Later during the cogni-

tive interviews, respondents were asked about any biases

Measuring

286 Volume 115 (6)

7/17/2019 Ssm 12130


they noticed in the tests (e.g., “Do you feel the test was

unfair to you or anyone who might take it?”). The respon-

dents felt the test was fair and addressed contexts they

understood. One student asked whether having prior

knowledge was the same as having an unfair advantage,

“I’ve been to Disney World where they have those shuttles

that take you from the parking lot to the front of it. If youhave ridden one or seen one then I think this [item #4]

would be really easy but then again, maybe not.” These

findings provide evidence that the PSM6 addressed several

facets of test content validity.

Validity Evidence From Response Processes

The average student raw score was 5.7 (SD = 3.1; 38%

correct). The lowest and highest scores were 1.0 and 12.0,

respectively. At first glance, these scores might seem

disproportionally low for a 15-item measure. However,

this is reasonable because research on students’ problem

solving has indicated that students tend to have low

scores on problem-solving measures (e.g., Verschaffelet al., 1999). Problem-solving tasks are frequently more

difficult than tasks asking students to apply a specific

procedure (Lesh & Zawojewski, 2007; Mayer &

Wittrock, 2006).

Validity Evidence From Internal Structure

Unidimensionality. A fundamental quality of all mea-

surement is unidimensionality. A unidimensional mea-

surement assesses only one latent variable or trait. To

evaluate unidimensionality of the PSM6 using Rasch mea-

surement, a brief description of the statistical criteria used

and respective results are provided. Items with negative point biserial correlations or infit/outfit mean square

(MNSQ) fit statistics falling outside .5–1.5 logits (log odds

units) are not meaningful for measurement (Linacre,

2002) and should be removed from a test as they do not

contribute to a unidimensional latent trait. No PSM6 items

had negative point biserial correlations (.21–.66), and all

but one item fell within Rasch MNSQ fit parameters for

both infit (.79–1.18) and outfit (.74–1.33) statistics. One

item had an appropriate point biserial (.49) and infit sta-

tistic (1.16), but had a slightly higher than expected outfit

statistic (1.59).

Conceptualization of the construct. When assessinginternal structure validity, it is important to look at how

the construct being studied relates to the predetermined

theoretical structure. We hypothesized the construct of

problem-solving ability, as measured by the PSM6, would

have a theoretical hierarchical structure with statistical and

probability items being easiest and geometry items as the

most difficult. Number sense, expressions and equations,

and ratio and proportion items were planned to vary in

difficulty based on the nature of the content addressed as

well as manner in which it was addressed. This hypothesis

stemmed from pilot testing results as well as impressions,

suggesting that some of the items’ content were commonly

found in elementary grade-level state mathematics stan-

dards during the era prior to the CCSSM. Figure 2 is a

variable map of the construct where people are on the left

and items are on the right.

T

T S

S

M

M

S

S

T

T

5

4

3

2

1

0

-1

-2

-3

-4

G2

G7

G14

RP10

SP13

NS5

NS11 RP/EE12 NS3

EE9

EE6

RP/EE15 RP/EE4

SP1 SP8

###

##

###

-######

-#####

######

-

-###

-#######

-

#####--

##########

--

-

-####

-

-

###

####

MEASURE PERSON – MAP – ITEM

<more ability> <difficult items>

<less ability> <easier items>

Figure 2. Variable map for PSM6. Each “#” is 2. Each “-” is 1. M denotes

mean. S indicates one standard deviation from the mean. T indicates two

standard deviations from the mean. Items are abbreviated by the domain(s)

addressed (e.g., geometry = G) and its location on the measure (e.g., second

item = 2).

Measuring


7/17/2019 Ssm 12130


Easier items and students with less problem-solving

ability are lower on the map, whereas more difficult items

and students with greater ability are higher. Items with a

difficulty level at the same measure as a student’s ability

level means that the student has a 50% chance of cor-

rectly answering the item. Items lower than a student’s

ability on the measure are easier for the student toanswer, and items above a student’s ability are more chal-

lenging for the student to answer correctly. Figure 2

graphically shows that the actual item difficulty hierarchy

for this measure aligns well with the hypothesized theo-

retical structure. However, there were some students not

able to answer any of the problem-solving items correctly,

and the geometry items were more difficult than student

ability of all participants in this study. Overall, results

from the assessment of unidimensionality and the vari-

able map suggest that validity evidence for the internal

structure of the PSM6 is high and the test is meaningfully

measuring the theorized construct of problem-solvingability.

Evidence From Relationship to Other Variables

Two student-level demographic variables were used to

assess relationship to other variables validity evidence:

gender and math class grouping. It was hypothesized that

gender should not impact problem-solving ability mea-

sures, and thus PSM6 scores would not significantly differ

by gender (nmale = 73, nfemale = 64). This hypothesis was

confirmed. No statistically significant differences were

found in problem-solving ability measures between males

( M =

1.28, SD =

1.78) and females ( M =

1.14, SD =

1.73);t (135) = .473, p = .637, two tailed.

In terms of mathematics class grouping, students were

either placed in one of two mathematics groups (i.e.,

tracking). One group consisted of two mathematics

classes with learners who typically performed below the

average score when compared with peers within the

school on high-stakes end-of-course tests (remedial;

n = 42). The second group consisted of four mathematics

classes with learners who typically showed average or

above-average performance on high-stakes end-of-course

tests when compared with the average score for the

school (regular; n = 95). It was hypothesized that studentsin the regular mathematics classes would perform signifi-

cantly better than students in the remedial mathematics

classes in terms of math problem-solving ability mea-

sures. Thus, our hypothesis was that PSM6 scores would

be significantly higher for students in the regular math-

ematics classes compared with those in the remedial

classes. This hypothesis was also confirmed. Students in

the remedial mathematics classes ( M = 2.35, SD = 1.59)

performed statistically significantly lower on the PSM6

compared with students in the regular mathematics

classes ( M = .71, SD = 1.58); t (135) = 5.56, p < .001, one

tailed.

Evidence From Consequences of Testing

Student identified feelings from taking test. Students

who participated in the cognitive (i.e., think-aloud) inter-views following the test were asked to share how they felt

after taking the measure. For instance, we wanted to know

whether the experience brought about negative feelings

that outweighed the benefits of knowing his/her problem-

solving ability. Did the test indicate any negative or posi-

tive biases toward respondents? Those who participated in

the think-aloud interviews shared that they did not expe-

rience feelings of negative affect beyond a typical testing

situation in the classroom. One below-average respondent

said, “It [the test] was hard. I felt like I could answer every

question but I’m not sure if they [answers] were right. I

want to know how I did.” During an interview, one above-average respondent shared, “The test was more difficult

than the tests we usually take in math. I like being chal-

lenged.” Drawing across these findings, there is a consen-

sus that the knowledge gained by taking the PSM6

outweighed potential negative factors from consequences

of testing. Thus, there was sufficient evidence for this

validity criterion.

Reliability Evidence

Rasch reliability is similar to traditional reliability

because both are the statistical reproducibility of a set of

values. However, Rasch computes reliabilities for itemdifficulties and person abilities, whereas CTT reliability

is computed for raw scores. Rasch reliability of .70 is

considered acceptable, .80 is good, and .90 is excellent

(Duncan, Bode, Lai, & Perera, 2003). For the PSM6,

Rasch item reliability was excellent at .97, suggesting

strong internal consistency for items.

Rasch separation indicates the number of statistically

significant groups that can be classified along a variable.

Computing separation is essential because a measure

“is useful only if persons differ in the extent to which

they possess the trait measured” (Bode & Wright, 1999,

p. 295). Rasch separation of 1.50 is considered accept-able, 2.00 is good, and 3.00 is excellent (Duncan et al.,

2003). Item separation was excellent (6.02) and item

measures ranged from −3.08 to 4.20 logits, indicating a

meaningful variable (i.e., problem-solving ability) is

being measured. Collectively, these statistics suggest

all items work together to form a measure capable

of reliably assessing a wide range of problem-solving

abilities.

Measuring

288 Volume 115 (6)

7/17/2019 Ssm 12130


Discussion

Filling Literature Gaps

One of the major contributions of our study is that it

fills multiple gaps in the literature. As seen in Table 2, we

were unable to locate a validated and published problem-

solving measure that addressed mathematics content stan-

dards at either the state or national level. The PSM6addresses the sixth-grade SMCs, which at present time

have been adopted by 45 states in the United States. Thus,

there is an agreed-upon content framework underlying this

measure so that researchers have a means to explore sixth-

grade students’ problem solving that is content relevant to

most students across the United States. Further, the PSM6

engages students in mathematical behaviors and habits

such as persevering through content-appropriate problem-

solving tasks, attending to precision, and modeling with

mathematics. By providing students with items that are

open, complex, and realistic in nature, students are engag-

ing with problems rather than simple exercises meant to promote efficiency with a known procedure (Kilpatrick

et al., 2001). With the validation of the PSM6, English-

speaking countries have an instrument to assess students’

problem solving much like Verschaffel et al. (1999) with

the Belgian WP tests. Hence, we are hopeful that this

measure provides a tool for future research into students’

mathematical problem solving.

PSM6 Assessment Rigor

The PSM6 includes quite challenging items; the

measure itself is difficult for the average sixth-grade

respondent. Problem solving requires different cognitivethinking than that typically needed for many classroom

mathematics tasks (Lesh & Zawojewski, 2007;

Schoenfeld, 2011; Verschaffel et al., 1999). Students, on

average, correctly answered approximately 6 items on the

15-item measure. While this seems unusually low, the

average score is similar to those found in past investiga-

tions of middle-school students’ problem solving (Charles

& Lester, 1984; Verschaffel et al., 1999).

Furthermore, this should not be a surprise when consid-

ering an item with respect to Bloom’s revised taxonomy

(Kratwohl, 2002). Kratwohl suggests six hierarchical cog-

nitive processes: remembering, understanding, applica-tion, analysis, evaluation, and creation. A problem-solving

task found on the PSM6 might be minimally classified as

an analysis-level task because respondents are asked to

break material into parts, decide how the parts relate to

one another and the overall item structure, and finally act

on the parts (Kratwohl, 2002). On the other hand, tradi-

tional mathematics achievement tasks do not necessarily

have to meet this threshold. Consider the following task

from a sixth-grade mathematics achievement test: “The

Tasty Soup Company uses 200 square inches of material to

make each 500-milliliter soup can. What does the 200

square inches represent?” (Ohio Department of Education,

2013). This item, which is common format on a mathemat-

ics standardized achievement test, asks respondents to

interpret the meaning of written communication, whichmeets Kratwohl’s description for an understanding-level

task. Thus, tasks found on the PSM6 will likely meet or

exceed the cognitive complexity found on most traditional

standardized mathematics achievement measures. Hence,

scores on the PSM6 should be lower than scores on tradi-

tional standardized mathematics achievement measures.

Instrument Validation Importance

Having confidence that results from mathematics mea-

sures are consistently measuring (reliability evidence)

what we expect them to measure (validity evidence) is

critical in mathematics education research. However, it is

not appropriate to attempt meaningful judgment making based on test results if the measures are not shown to be

empirically sound. In our validation study, multiple ana-

lytic methods (i.e., qualitative, statistical, and Rasch mea-

surement) allowed us to demonstrate that the PSM6 has

met minimum criteria in many areas of validity and reli-

ability evidence. To further advance the field of mathemat-

ics ability assessment, more instruments need to be

developed and empirically evaluated through the use of

more modern analytical approaches. Our description of

Rasch modeling and results from its use might assist

others to develop high-quality instruments with sufficientevidence for validity and high internal consistency.

Significance and Implications

The primary purpose of this study was to explore the

psychometric properties of the PSM6. Results from this

validation study indicate that the PSM6 met the criteria for

measuring sixth-grade students’ problem-solving abilities

within the context of the CCSSM. This study provides a

tool to study sixth-grade students’ problem-solving ability

on items addressing content described in the mathematics

standards adopted by numerous states. Moreover, the

PSM6 engages students in problems that encourage them

to engage in mathematical behaviors and habits describedin the SMPs. The items are characterized as being open,

complex, and realistic problems. Ones like these are

needed to stimulate problem-solving and critical thinking

skills during classroom instruction and assessment.

Solving problems on the PSM6 was challenging for most

students, but not outside of their cognitive grasp.

With this in mind, the PSM6 is not a high-stakes

measure and we do not encourage its use in that fashion. It

Measuring


7/17/2019 Ssm 12130


does, however, provide useful information about sixth-

grade students’ problem-solving ability while also

addressing CCSSM found across all sixth-grade domains.

Researchers as well as school personnel may feel confi-

dent using this measure to gather data about students’

problem-solving abilities, which are a key feature of the

CCSSM. Additionally, the PSM6 is not a measure of general sixth-grade mathematics achievement, and we do

not advocate for its use in that manner either. It has been

shown to reliably assess students’ problem-solving abili-

ties within a specific content framework and there is suf-

ficient validity evidence for its use in this context only.

Limitations and Future Directions

One limitation of any problem-solving measure intend-

ing to meet the aim of being relevant to content standards

is that only 45 of the 50 states across the United States

have adopted the CCSSM. Thus, the mathematics found in

this measure is not addressed by standards universally

adopted across the United States. A second limitation isthat the items are somewhat more challenging than the

students’ abilities (as can be seen in the variable map in

Figure 2); however, this should be expected on a problem-

solving measure compared with one consisting of rote

exercises because problem-solving typically is more chal-

lenging for students to master compared with completing

exercises (Kilpatrick et al., 2001). While this minor

student ability and item difficulty misalignment exists, the

PSM6 has exceeded the minimum criteria for providing

reliable and valid information about students’ problem-

solving abilities.One future area of research to explore related to the

PSM6 pertains to what teachers can do in the CCSSM era

to support students’ problem solving. Initial research

using the PSM6 suggests that providing students with

problems that address CCSSM content and practices

support students’ problem-solving performance (Bostic

et al., in press). A second future direction of study is to

develop similar measures for other grade levels as well as

another version that includes items from other grade

levels. This latter version might provide useful data that

teachers might seek to consider whether a student might be

ready to matriculate to the next grade-level mathematicscontent course or how to best enrich individual students’

mathematics education.

ReferencesAmerican Educational Research Association (AERA), American Psychologi-

cal Association, & National Council on Measurement in Education.

(2014). Standards for educational and psychological testing . Washington,

DC: American Educational Research Association.

Boaler, J., & Staples, M. (2008). Creating mathematical future through an

equitable teaching approach:The case of Railside school. Teachers College

Record , 110, 608–645.

Bode, R., & Wright, B. (1999). Rasch measurement in higher education. In

J. Smart & W. Tierney (Eds.), Higher education handbook of theory and

research (Vol. XIV , pp. 287–316). Netherlands: Springer.

Bond, T., & Fox, C. (2007). Fundamental measurement in the human sciences

(2nd edn). Mahwah, NJ: Erlbaum.

Bostic, J. (2015). A blizzard of a value. Mathematics Teaching in the MiddleSchool , 20, 350–357.

Bostic, J., Pape, S., & Jacobbe, T. (2011). Validating two problem-solving

instruments for use with sixth-grade students. In L. Wiest & T. Lamberg

(Eds.), Proceedings of the 33rd annual meeting of the North American

chapter of the international group for the psychology of mathematics

education (pp. 756–763). Reno, NV: University of Nevada – Reno.

Retrieved from http://www.pmena.org/html/proceedings.html

Bostic, J., Pape, S., & Jacobbe, T. (in press). Encouraging sixth-grade

students’ problem-solving performance by teaching through problem

solving. Investigations in mathematics learning . Retrieved from http://

scholarworks.bgsu.edu/teach_learn_pub/31/

Charles, R., & Lester, J. F. (1984). An evaluation of a process-oriented instruc-

tional program in mathematical problem solving in grades 5 and 7. Journal

for Research in Mathematics Education, 15, 15–34.

Crocker, L., & Algina, J. (2006). Introduction to classical and modern test

theory (2nd edn). Mason, OH: Wadsorth Publishing.

De Ayala, R. (2009). The theory and practice of item response theory. New

York: Guilford Press.

Duncan, P., Bode, R., Lai, S., & Perera, S. (2003). Rasch analysis of a new

stroke-specific outcome scale: The stroke impact scale. Archives of Physi-

cal Medicine and Rehabilitation, 84, 950–963.

Embretson, S., & Reise, S. (2000). Item response theory for psychologists.

Mahwah, NJ: Erlbaum.

Gall, M., Gall, J., & Borg, W. (2007). Educational research: An introduction

(8th edn). Boston: Pearson.

Glaser, B., & Strauss, A. (1967/2012). The discovery of grounded theory:

Strategies for qualitative research. Mill Valley, CA: Sociology Press.

Hatch, A. (2002). Doing qualitative research in education settings. Albany,

NY: State University of New York Press.Kanold, T., & Larson, M. (2012). Common core mathematics in a PLC at

work: Leader’s guide. Bloomington, IN: Solution Tree Press.

Kilpatrick, J., Swafford, J., & Findell, B. (2001). Adding it up: Helping

children learn mathematics. Washington, DC: National Academy

Press.

Kincaid, J., Fishburne, R., Rogers, R., & Chissom, B. (1975). Derivation of

new readability formulas (Automated Readability Index, Fog Count and

Flesch Reading Ease Formula) for Navy enlisted personnel. Research

Branch Report 8–75. Millington, TN: Naval TechnicalTraining, U. S. Naval

Air Station, Memphis, TN.

Koestler, C., Felton, M. D., Bieda, K. N., & Otten, S. (2013). Connecting the

NCTM process standards and the CCSSM practices. Reston, VA: National

Council of Teachers of Mathematics.

Kratwohl, D. (2002). A revision of bloom’s taxonomy: An overview. Theory

into Practice, 41(4), 212–218.

Lesh, R., & Zawojewski, J. (2007). Problem solving and modeling. In F. Lester

Jr. (Ed.), Second handbook of research on mathematics teaching and learn-

ing (pp. 763–804). Charlotte, NC: Information Age Publishing.

Linacre, J. (2002). Optimizing rating scale category effectiveness. Journal of

Applied Measurement , 3(1), 85–106.

Linacre, J. (2012). Winsteps (Version 3.74) [Computer Software]. Beaverton,

OR: Winsteps.com.

Matney, G., Jackson, J., & Bostic, J. (2013). Connecting instruction, minute

contextual experiences, and a realistic assessment of proportional reason-

ing. Investigations in Mathematics Learning , 6 , 41–68.

Measuring

290 Volume 115 (6)

7/17/2019 Ssm 12130


Mayer, R., & Wittrock, M. (2006). Problem solving. In P. Alexander &

P. Winne (Eds.), Handbook of educational psychology (pp. 287–303).

Mahwah, NJ: Erlbaum.

National Center for Education Statistics. (2009). The nation’s report card:

Mathematics 2009 ( NCES 2010-451). Retrieved from http://nces.ed.gov/

nationsreportcard/pubs/main2009/2010451.asp

NCTM. (1989). Curriculum and evaluation standards for school mathemat-

ics. Reston, VA: Author.

NCTM. (2000). Principles and standards for school mathematics. Reston,VA: Author.

NCTM. (2009). Focus in high school mathematics: Reasoning and sense

making . Reston, VA: Author.

NGA, & CCSSO. (2010). Common core standards for mathematics. Retrieved

from http://www.corestandards.org/assets/CCSSI_Math%20Standards.pdf

Ohio Department of Education. (2013). Grade 6 – Released tests

materials. Retrieved from http://education.ohio.gov/Topics/Testing/

Testing-Materials/Released-Test-Materials-for-Ohio-s-Grade-3-8-Achie/

Grade-6-Released-Tests-Materials

Organization for Economic Development. (2010). PISA 2009 results: What

students know and can do—student performance in reading, mathematics

and science Vol. ( I ). Retrieved from http://www.keepeek.com/Digital

-Asset-Management/oecd/education/pisa-2009-results-what-students

-know-and-can-do_9789264091450-en#page1

Palm, T. (2006). Word problems as simulations of real-world situation: A

proposed framework. For the Learning of Mathematics, 26 , 42–47.

Polya, G. (1945/2004). How to solve it . Princeton, NJ: Princeton University

Press.

Rasch, G. (1960/1980). Probabilistic models for some intelligence and attain-

ment tests. Copenhagen: Denmarks Paedagoiske Institut.

Schoenfeld, A. (2011). How we think: A theory of goal-oriented decision

making and its educational applications. New York: Routledge.

Smith, E., Conrad, K., Chang, K., & Piazza, J. (2002). An introduction to

Rasch measurement for scale development and person assessment. Journal

of Nursing Measurement , 10, 189–206.

Smith, R. (1996). A comparison of methods for determining dimensionality in

Rasch measurement. Structural Equation Modeling , 3, 25–40.

Strauss, A., & Corbin, J. (1998). Basics of qualitative research techniques and

procedures for developing grounded theory. London: Sage.Verschaffel, L., De Corte, E., Lasure, S., Van Vaerenbergh, G., Bogaerts, H.,

& Ratinckx, E. (1999). Learning to solve mathematical application prob-

lems: A design experiment with fifth graders. Mathematical Thinking and

Learning , 1, 195–229.

Waugh, R., & Chapman, E. (2005). An analysis of dimensionality using factor

analysis (true-score theory) and Rasch measurement: What is the differ-

ence? Which method is better? Journal of Applied Measurement , 6 , 80–99.

Wright, B. D. (1996). Comparing Rasch measurement and factor analysis.

Structural Equation Modeling , 3, 3–24.

Authors’ Notes

Keywords: learning processes, problem solving, stu-

dents and learning, student assessment.Correspondence concerning this article should be

addressed to Jonathan David Bostic, School of Teaching

and Learning, Bowling Green State University, 529 Edu-

cation Building, Bowling Green, OH, USA. E-mail:

[email protected].

A Research to Practice article based on this paper

can be found alongside the electronic version at http://

wileyonlinelibrary.com/journal/ssm.

Measuring


ssm 12130

Documents