jag: joint assessment and gradingmedianetlab.ee.ucla.edu/papers/icmlws6.pdf · jag: joint...

JAG: Joint Assessment and Grading

Igor Labutov [email protected]

Cornell University, Ithaca, NY

Christoph Studer [email protected]

Cornell University, Ithaca, NY

AbstractTest-taking (assessment) and grading are com-monly treated as two separate processes and per-formed by two different groups of people. Re-cently, the boundaries between test-taking andgrading have started to blur. Peer-grading, forexample, leverages student knowledge and ef-fort to grade other students’ answers. Despitethe considerable success of peer-grading in mas-sive open online courses (MOOCs), the processof test-taking and grading are still treated astwo distinct tasks which typically occur at dif-ferent times, and require an additional overheadof grader training and incentivization. In orderto overcome the drawbacks of peer-grading, wepropose joint assessment and grading (JAG), anovel approach that fuses test-taking and grad-ing into a single, streamlined process that appearsto students in the form of an explicit test, butwhere everyone also acts as an implicit grader.We demonstrate the effectiveness and limits ofJAG via simulations and a real-world user study.

1. IntroductionMultiple-choice questions (MCQs) are a common way toovercome the so-called scaling problem in assessment:grading a large number of submissions from many studentsin an efficient manner (Roediger III & Marsh, 2005). Theimmense scaling potential of MCQs to large classroom set-tings such as massive open online courses (MOOCs), how-ever, comes at the cost of rigidity and considerable sensitiv-ity to its design—questions and options that are designedwithout the consideration of the students’ likely miscon-ceptions, for example, can yield uninformative questions(Haladyna, 1997; Rodriguez, 2005) (e.g., when the distrac-

tors are easy to eliminate). As a consequence, the effortsaved in grading is often offset by the effort required to de-sign an effective MCQ test.

More recently, the practical realization of MOOCs openedanother opportunity to solving the problem of scaling inlarge-scale assessments, one that leverages the size of theclassroom to its advantage: peer-grading. In its traditionalform, peer-grading assigns to each student a secondary roleof a grader. Students are responsible for validating the cor-rectness of other students’ solutions in order to assign ascore, typically in accordance with a rubric provided byan instructor. The advantage of peer-grading is its flex-ibility to students’ submissions, which may range fromshort answers, to essays, diagrams, code, or entire projects(Kulkarni et al., 2015). Peer-grading, however, introducesseveral challenges to guarantee a successful deployment.First, each student differs in their ability to grade, and thegrades assigned by different students must be reconciledin a reasonable way. Second, grading becomes an addi-tional burden on the students, and mechanisms must be putin place that not only incentivize participation and effort,but prevent students from “gaming” the process. Recentresearch in peer-grading has started to address these chal-lenges (Piech et al., 2013; Raman & Joachims, 2014; Wuet al., 2015)

1.1. JAG: an alternative to Peer Grading

We present a novel alternative approach to peer-gradingthat naturally resolves the challenges of grade aggrega-tion and incentivization. We propose joint assessment andgrading (JAG), which fuses grading and assessment into asingle, streamlined process by re-framing grading as addi-tional testing. Our approach is motivated by the fact that a“grader” that has no answer key, when presented with thelisting of other students’ answers, is no different than a test-taker facing a multiple-choice question (with multiple pos-sible correct or incorrect answers). In other words: a stu-dent selecting what they believe to be the correct answer inan MCQ constructed from the open-response submissions


of other students is in effect simultaneously (i) grading theother students and (ii) being assessed by his or her ability toselect the correct answer. In peer-grading, we already facethe challenge of noisy inputs from the (potentially unmoti-vated) graders. By re-framing the act of grading as that ofMCQ testing, the source of the apparent noise in gradingbecomes distributed according to the ability of the studentsin the class.

The proposed mechanism of JAG combines the advantagesof both worlds: the structure of multiple choice questionsand the flexibility to general response types offered bypeer-grading. First, by constructing the MCQs directlyfrom students’ open-response submissions, the questionsnaturally capture the distribution of misconceptions presentin the population of students being tested, requiring lit-tle to no instructor input. Second, our framework offersa mechanism for automatically grading open-response sub-missions, thus facilitating greater student engagement andhigher-order thinking characteristic to open-response ques-tions (Haladyna, 1997). Third, by re-framing the task ofgrading as that of testing, the students are incentivized inthe context of a familiar task: namely by expending theireffort towards correctly answering an MCQ, they are im-plicitly directing that effort towards grading other students’submissions. At the same time, the students are not bur-dened with (what they may perceive as) a “thankless” jobof grading, but instead in the process of answering the addi-tional MCQs, the students are provided with an additionalopportunity to demonstrate their knowledge.

In this paper, we formalize the process of JAG as a sta-tistical estimation problem. At the heart of our approachis the traditional Rasch model that captures the interac-tion between student abilities and question difficulties indetermining the likelihood of a student answering a ques-tion correctly (Rasch, 1993). We develop an expectationmaximization (EM) algorithm for estimating the parame-ters of the proposed model in an unsupervised setting (i.e.,in absence of an answer key), and demonstrate the effec-tiveness of our framework through a real-world user-studyconducted on Amazon’s Mechanical Turk. Additionally,we investigate the key properties and limitations of our ap-proach via simulations.

2. Related WorkOur work builds on the recent progress in two distinct ar-eas: crowd-sourcing and peer-grading, that we unite andextend within our proposed framework for joint assessmentand grading (JAG).

2.1. Crowdsourcing

An important task in crowdsourcing is known as label-aggregation, and is concerned with the problem of opti-mally recovering some underlying ground truth (e.g., im-age class label) from a number of (unreliable) humanjudgements. See (Hung et al., 2013) for a detailed review.In the context of education, the task of automatically identi-fying the correct answers from open-response submissionsis closely related to the task of label aggregation. Withinthe field of crowd-sourcing, the work of (Dawid & Skene,1979; Whitehill et al., 2009; Bachrach et al., 2012) are themost related to our approach. (Dawid & Skene, 1979) wasthe first to suggest an expectation maximization (EM) algo-rithm for label aggregation, motivated by a clinical settingof making a diagnosis. More recently, (Whitehill et al.,2009) extended this approach to model the variation in taskdifficulty in the context of image labeling. In the context ofeducation, (Bachrach et al., 2012) has proposed a statisticalmodel for aggregating answers from “noisy” students, withthe goal of automatically identifying the correct answers toMCQs. They deploy an EM algorithm for Bayesian infer-ence, and demonstrate the ability to infer correct answersaccurately in a setting of an IQ test. Our work can beseen as a generalization of (Whitehill et al., 2009; Bachrachet al., 2012), where we explicitly model the dependenceamong question choices and students that generate thosechoices in the context of answering open-response ques-tions.

2.2. Peer-grading

Much of the recent research in peer-grading addresses arelated problem of aggregating a number of “noisy” gradessubmitted by students in a statistically principled manner.Models such as the ones in (Piech et al., 2013; Raman& Joachims, 2014) pose the problem of peer-grading asthat of statistical estimation. Since traditional grading as-sumes that graders are in possession of a grading rubric,statistical models of peer-grading are concerned primarilywith accounting for the reliability and bias of graders inevaluating assignments against a gold-standard. In con-trast to such “explicit” models of grading, we view gradingas an implicit process that results as a by-product of stu-dents’ genuine attempt to answer MCQs constructed fromthe open-response submissions of other students. As such,we do not require additional “grader-specific” parameters,as grading in our framework is subsumed by the responsemodel (model of how students answer questions as a func-tion of their ability and question difficulty). We do note,however, that one of the proposed models in (Piech et al.,2013) explicitly couples grading and ability parameters inan attempt to capture the intuition that better students mayalso be better graders. This intuition can be viewed as be-ing taken to its extreme in our setting: blurring the bound-


aries between grading and test-tasking ensures that bettersstudents are more reliable graders by construction.

2.3. Clustering submissions

We also take note of an emerging area of work focusedon clustering open-response submissions (not necessarilyfor peer-grading). An important by-product of scaling inthe context of assessment is the inevitable increase in sim-ilarity between student open-response submissions, whichresults in redundancy during grading. Moreover, a recenttheoretical result of (Shah et al., 2014) indicates that with-out some way of reducing dimensionality of submissions,there will always be a constant number of misgraded as-signments (assuming certain scaling properties of the class-room). In an effort to reduce the workload of the instruc-tors (or peers), there has been a number of successful at-tempts to cluster responses in specific domains, e.g., lan-guage (Basu et al., 2013; Brooks et al., 2014) and mathe-matics (Lan et al., 2015). Answer clustering is even morecritical in the framework of JAG, where the practical con-straints of testing limit the number of options that can beshown in a multiple choice question. To generate effec-tive questions, the presented options must offer a represen-tative sample of the diverse open-response submissions ina large classroom. In Section 7, we demonstrate that thepattern of selected options alone provides the necessarysignal to perform domain-agnostic clustering of submis-sions (i.e., without considering the content of the answers).This observation demonstrates the tremendous versatilityof this framework to jointly assess, grade and cluster open-response submissions.

3. Model3.1. Fully observed setting

We start by reviewing the classic IRT Rasch model that willserve as the foundation of our approach. Consider a set ofstudents S and a set of questions Q, where a student i ∈ Sis endowed with an ability parameter si ∈ R, and eachquestion j ∈ Q is endowed with a difficulty parameter qj ∈R (note that we capitalize all sets in our notation). By abuseof notation, we will often overload si to refer to both, thestudent index i and their ability, depending on the context;the same applies to qj , which we use to refer to the questionitself as well as its difficulty. The well-established 1-PLIRT Rasch model (Rasch, 1993) expresses the probabilitythat the student si answers question qj correctly via thefollowing likelihood function:

P (zi,j | si, qj) =1

1 + exp (−zij(si − qj)), (1)

where zi,j ∈ {+1,−1} is the binary outcome of studentsi’s attempt of question qj ; we use +1 and −1 to desig-

nate correct and incorrect responses, respectively. If we arein the possession of an answer key for each question, thenwe also know {zi,j}, ∀i, j (we will refer to this as the fullyobserved setting). This allows us to estimate the ability ofeach student and the difficulty of each question by maxi-mizing the likelihood of all outcomes under our model:

{si,∀i, qj ,∀j} = argmaxsi,qi

∏zi,j∈D

P (zi,j | si, qj), (2)

where D = {zij} is the set of outcomes (e.g., of a test).

3.2. Partially observed setting

Consider now the setting where some (or all) of the out-comes zi,j ∈ D are not observed. In practice, this is thecase, for example, when the answer key to some of (orall) the questions is not available. In our setting, wherethe choices in the multiple choice question are in fact otherstudents’ submissions, the correctness of these submissionsare not known a priori. Let Aj be the set of open-responseanswers submitted by a subset of students in Sopen ⊆ S inresponse to the question qj . At some later time, a studentsi ∈ Smcq ⊆ S is presented with the same question qj , butin the form of a multiple-choice question, with the optionsbeing exactly the answers inAj (note that Smcq need not bedisjoint with Sopen). The student si is informed that theremay be zero or more correct answers in the set of optionsin Aj and they are instructed to select “all that apply.” Thestudent si goes through each option in Aj and submits aresponse to that option. Let yi,j = {yki,j} be the set ofsuch responses made by student si on the set of answersAj , where yki,j ∈ {+1,−1} is the student sth

i selection onthe kth answer (option) in Aj . In other words the variablesyki,j are the observations of whether the student si “clicked”on answer k to question j (i.e., that student judged that par-ticular answer to be correct). In what follows, we describethe statistical model that relates the student and questionparameters which we are interested in estimating, to the setof response observations. Our model consists of two com-ponents: (i) the open-response component that models thestudents (and their responses) that generate open-responseanswers, and (ii) the multiple choice model component thatmodels the students (and their responses) that are presentedwith the multiple choice version of each question.

Open-response model: Because we do not know whetherthe submitted open-response answers are correct, we treatthe correctness of each submission as a hidden variablezi,j ∈ {+1,−1}; this allows us to express the componentof the overall likelihood of our data, responsible for theopen-response answers only as follows:

P ({zi,j} | Sopen, Q) =∏zi,j

P (zi,j | si, qj),


where P (zi,j | si, qj) is the Rasch likelihood given in (1).Note that we drop the k-superscript notation for the zi,jvariables because each student is assumed to provide atmost one open-response submission to each question (sincek indexes the answers to a specific question). The observedresponses to the multiple-choice version of each question(described next) will provide the necessary data to estimatethe parameters in the model, including the hidden variableszi,j , i.e. the correctness of each open-response submission.

Multiple choice model: Now consider the setting whereeach question is presented in the form of a MCQ. Recallthat the student answering the multiple choice question ispresented with multiple options, each generated by some(other) student in the set Sopen, and where several options(or even no options) may be correct. The intuition that wewant to capture in our model is that a student of great rel-ative ability (i.e., si � qj) will select (yki,j = +1) the op-tion (i.e., judge it as being correct) if that option is actuallycorrect (zkj = +1). The same student will not select thatoption (ykij = −1) if that option is incorrect (zkj = −1).At the same time, a student of poor relative ability (i.e.,si � qj) will not not be able to identify the correct an-swer, regardless of whether the option is correct, i.e., theywill guess. This intuition can be captured by the follow-ing function that parametrizes the likelihood of student siselecting the option k to question qj to be correct:

P(yki,j | si, qj , zkj

)=

1

2

(1

1 + exp(−ykijzkj (si − qj)

) + 1

). (3)

One can easily verify that this likelihood satisfies the re-quirements outlined above by considering every combina-tion of the assignment to yki,j and zkj , and taking the limitsof si − qj → ∞ (great relative ability) and qi − sj → ∞(poor relative ability). Note that this time we drop the indexi (index of the student who generated the option k in ques-tion qj) in zkj , as in the above, we use si to refer to the stu-dent answering the multiple choice version of the question.Note that the above likelihood follows the same intuition asproposed by (Bachrach et al., 2012), but in a setting withan arbitrary number of choices and one correct answer. InFigure 1 we illustrate both components of the likelihood(the open-response and the multiple choice component) asa graphical model. In this illustration, we use the notationsi′ to refer to the student that generated the answer and sito refer to the student that observes the answer as a choicein a multiple choice version of question qj .

If we make a leap of assuming conditional independencebetween the student sth

i responses to each option in a multi-ple choice question (conditional on si, qj and zkj ), then wecan express the likelihood of observing every response to

every multiple choice questions as follows:

P ({Yj} | Smcq,Q, {Zj})=

∏si∈Smcq

∏qj∈Q

∏yki,j∈Yj

zkj ∈Zj

P (yki,j | si, qj , zkj ).

The assumption of conditional independence requires someadditional justification in our setting. Intuitively, we arejustified in claiming conditional independence when we be-lieve that the set of conditioning variables accounts for ev-erything that may be shared across observations, such thatthe only remaining source of the variance is noise. Forexample, observations of different students answering thesame question on the test are conditionally independentgiven the difficulty of that question. In modeling the likeli-hood of a student selecting each option in a multiple choicequestion, however, we overlook the potential for the op-tions to be related. In an extreme example, two optionsmay be identical or paraphrases of each other, which weexpect to be common-place when these options are gener-ated by students in a large classroom. In this case, condi-tional independence no longer holds without an introduc-tion of additional conditioning variables that group the re-lated options in some way. To some extent, this problemcan be mitigated by pre-processing and clustering similaranswers before displaying them as options in a multiplechoice question. This is a strategy that we take in this work.In Section 7, we discuss our ongoing work in automaticallyclustering answers based on the response patterns to multi-ple choice questions.

To complete our model, we combine the open-response andthe multiple-choice components:

P (y, z | s, q) = P (y | z, s, q)︸︷︷︸multiple choice

P (z | s, q)︸︷︷︸open response

, (4)

where we adopt vector notation for the variables and pa-rameters in our model that will facilitate the developmentof the learning algorithm in Section 4. In order to givethe dimensions for each of the variables in (4), assume thateach student in Sopen provides an open-response answer toeach of the questions in Q and that each student in Smcqalso answers each question in Q (which entails providinga response to each option contained in a given question).Under these assumptions then, z ∈ {+1,−1}|Sopen||Q|,y ∈ {+1,−1}|Sopen||Smcq||Q|, s ∈ R|Smcq∪Sopen| and q ∈ R|Q|.

4. Parameter LearningWe now derive the expectation maximization (EM) algo-rithm for obtaining an approximate maximum likelihoodestimate (MLE) of the parameters s and q of the modelin (4). We briefly outline the key steps in obtaining thealgorithm.


Smcq

Sopen

qj

si

si′

Q

. . .

zki′j

ykij

generate

select

option

Figure 1. A probabilistic graphical model that summarizes theproposed joint assessment and grading (JAG) framework. Ourmodel jointly captures the statistical dependencies between theabilities of students that generate open-response answers (Sopen),the abilities of students that select the correct answers whenthey are presented in a form of a multiple choice test (Smcq),the underlying question difficulty qj and the observed responses(yk

ij ∈ {+1,−1}). The correctness of each open-response answerzkj ∈ {+1,−1} (omitting the index of student that generated theanswer) is a hidden variable, the state of which is inferred in ad-dition to the remaining parameters during inference. Note that themodel is able to “grade” the open-response submissions of stu-dents by integrating the response patterns of other students whowere presented a multiple choice version of the question.

E-step: We compute the expectation of the log-likelihood(the logarithm of Equation 4) with respect to the unob-served variables z which yields a function f(s, q) of the pa-rameters s and q only. The expectation is performed withrespect to the posterior distribution of z given a previousestimate of s and q (or an initial guess).

M-step: We obtain an updated estimate of parameters sand q by maximizing f(s, q) obtained in the E-step.

The above procedure iterates until convergence. Below wegive both steps explicitly in the context of the joint assess-ment and grading (JAG) framework.

E-step: Let s and q be an intermediate estimate of theparameters. Conditioning on these estimates, the posteriorof zkj (correctness of answer (option) k to question qj) isa Bernoulli random variable with the probability of being

correct given by (up to a normalizing constant):

P (zkj = 1 | s, qj) ∝P (zkj = 1 | si′ , qj)︸︷︷︸

open response

∏si∈Smcq

P(yki,j | si, qj , zkj = 1

)︸︷︷︸

multiple choice responses

.

(5)

The posterior over the answer correctness zkj naturally in-tegrates two sources of information: (i) the likelihood thatthe student who generated the answer was correct, and(ii) the likelihood that the students answering the multiplechoice version of the question “picked” this answer as cor-rect (note that si′ ∈ Sopen and si ∈ Smcq). Each likeli-hood is parametrized by the model’s current estimate of thestudents’ abilities and question difficulties, and as a con-sequence gives more weight to the signal coming from themore able students.

M-step: The expectation of the log-likelihood with re-spect to z yields the following expression:

Ez[logP (y, z | s, q)] = f(s, q) =

=∑Dmcq

P(zkj = +1 | s, qj

)logP

(yki,j | si, qj , zkj = +1

)︸︷︷︸R1

+∑Dmcq

P(zkj = −1 | s, qj

)logP

(yki,j | si, qj , zkj = −1

)︸︷︷︸R1

+∑Dopen

P(zkj = +1 | s, qj

)logP

(zki′,j = +1 | si′ , qj

)︸︷︷︸R2

+∑Dopen

P(zkj = −1 | s, qj

)logP

(zki′,j = −1 | si′ , qj

)︸︷︷︸R2

where we introduce the short-hand Dopen and Dmcq to referto the sets of students, questions and responses that wereinvolved in (i) generating open-response submissions and(ii) multiple-choice responses respectively. The above ex-pression is a weighted linear combination of (log-) Rasch-likelihoods (R1 and R2 are given in (3) and (1) respec-tively), and can be easily maximized with a small modi-fication to an existing Rasch solver to account for the con-stants. We use the L-BFGS algorithm (Zhu et al., 1997) toperform this optimization step.

Initialization: Note that while the M-step is convex, thejoint optimization problem in z, s, and q is not convex,and in general the EM algorithm will only yield an approx-imate solution and may get trapped in local optima. Theproblem becomes more pronounced in datasets with fewinteractions, e.g., small classrooms. As such, initializationplays an important role in determining the quality of theobtained solution. A natural heuristic for initializing theposteriors over z is with the fraction of “votes” given to the


answer (i.e., fraction of students that identified the answeras correct). Observe that in computing the posterior overthe answer correctness during the E-step (Equation 5), theposterior becomes exactly the fraction of “votes” given tothe answer if we assume equal difficulty and ability param-eters for all students, suggesting that initializing the pos-terior in this way provides a reasonable starting point forthe algorithm. This heuristic is also suggested in (Dawid& Skene, 1979). We demonstrate the effectiveness of thisheuristic in Section 6.

5. Experiments with Synthetic DataIn order to understand the behavior of our framework ina hypothetical classroom, we evaluate the model on a se-ries of synthetically generated datasets. As our model at-tempts to infer the correctness of each answer entirely fromthe choices made by students in answering multiple choicequestions, an important concern is the limitation of infer-ence on difficult questions. Difficult questions are ques-tions where we can expect the majority of students to beunable to identify the correct answers, and present a chal-lenge to any model that relies on aggregating judgements.The model’s ability to recover the correct answer despitethe majority being incorrect, fundamentally requires themodel to leverage its estimates of students’ abilities so asto weigh the judgements of better students proportionallyhigher. Also note that we are concerned with questions ofgreat relative difficulty (with respect to the ability of thestudents in the class), not absolute difficulty.

We can simulate an entire spectrum of regimes that presenta varying degree of difficulty to inference, and evaluate themodel’s performance in correctly inferring the correct an-swers in each regime. We accomplish this by generating asynthetic population of students and questions with a fixedexpected relative competency (i.e., E[s − q] = k, wheres ∼ p(s) and q ∼ p(q)), performing inference with ourmodel on the generated observations, and computing thefraction of correctly inferred correct answers (accuracy) fordifferent E[s − q]. See Figure 2 for an illustration. Notethat E[s − q] is a quantity that conveniently summarizesthe classroom in terms of its “competency” relative to thetesting material. Large values of E[s − q] indicate that thestudents are well-prepared, and most will answer the ques-tions correctly.

5.1. Simulation procedure

We let p(s) = N (µs, σ = 2) and p(q) = N (µq, σ = 2).We generate a synthetic classroom with the following pa-rameters |Sopen| = 10, |Smcq| = 10, |Q| = 15, whereevery student in Sopen submits a open-response to everyquestion in Q, every student in Smcq responds to everyquestion (which entails providing a respond to every op-

E[s − q]

p(s)p(q)

Figure 2. Distributions used in generating synthetic data, wherep(q) and p(s) are the distributions of question difficulty and stu-dent ability respectively. The quantity E[s − q] represents theaverage relative competency of the classroom: a large value ofE[s − q] indicates that the majority of students will answer mostof the test items correctly, and vice-versa. See Figure 3 for theeffect of the class distribution on performance.

tion) and |Smcq ∩ Sopen| = ∅. We then sample hidden(z) and observed (y) variables from Bernoulli distributionsparametrized by (2) and (3) respectively.

Figure 3 illustrates the performance of the model as a func-tion of the expected relative competency of the students(E[s − q]). We compare the performance of our model toa simple majority baseline (i.e., label the answer as cor-rect if the majority of the students select it). As expected,the majority baseline works best when the relative compe-tency of the class is high (since most students will correctlyidentify the correct answers). The performance degradessignificantly in the regime where the relative competencyis negative (i.e., most students are expected to answer thequestions incorrectly). Observe that the model is able tomaintain a significant performance margin (>10%) overthe baseline even in the regime of low relative competency.

6. Real-World ExperimentsWe emulate a classroom setting on the Amazon Mechan-ical Turk platform by soliciting Mechanical Turk workersto participate in a reading comprehension task. The studywas conducted in two separate phases with a different setof workers in each: (i) the open-response task and (ii)the multiple choice task. In each task, a worker was pre-sented with an article1, followed by a set of 15 questions.In the open-response task, the questions were displayed inan open-response format, and the workers were asked totype in their response. In the multiple choice task, the same15 questions were presented in a multiple-choice format,

1Unit 7.2 (Language) from the OpenStax Psychology textbook


−4 −3 −2 −1 0 1 2

E[s− q]

0.5

0.6

0.7

0.8

0.9

1.0

Acc

urac

y(c

orre

ctan

swer

s)

ModelMajority

Figure 3. Accuracy in predicting the correct answers on syntheticdata, as a function of the average relative competency in the class-room (measured in the multiples of standard deviations of the dis-tributions). The simple majority-vote baseline performs compara-bly with our model for class distributions with large relative com-petency (since the majority of the students answer most questionscorrectly). The model significantly outperforms the baseline inthe regime of lower relative competency (i.e., when most ques-tions are too difficulty for the majority of the students).

with the choices aggregated from the open-response sub-missions obtained in the open response task. The answerscollected in the open response task were clustered semi-automatically before being displayed as choices in the mul-tiple choice task. The clustering step aggregated identi-cal answers or answers within a few characters in differ-ence (for example due to spelling errors), and semanticallyidentical answers were then grouped manually (e.g., para-phrases). Clustering answers is a critical pre-processingstep as it ensures that a reasonable number of choices isshown as part of the multiple choice question, as well asthat the conditional independence assumption discussed inSection 3 holds. In Section 7, we will outline our ongo-ing work in extending the model to automatically clusteropen-response submissions based on the multiple choiceresponse signal alone.

In total, 15 workers participated in the open-response taskand 82 workers participated in the multiple choice task. Atotal of 225 open-response submissions were generated inresponse to the total of 15 comprehension questions, result-ing in 101 distinct choices after clustering.

6.1. Results

We evaluate the effectiveness of our algorithm on the datacollected via Amazon’s Mechanical Turk using two perfor-mance metrics: (i) accuracy in predicting the correctness ofeach answer and (ii) quality of the predicted ranking of the

students. We evaluate our algorithm in a semi-supervisedsetting where we provide a set of partially labeled items,i.e., we label correctness for a subset of the answers. Thisrepresents a practical use-case of our framework—insteadof being entirely hands-off, an instructor may choose tomanually grade a subset of the students’ answers to im-prove the performance of automatic inference. We evaluatetwo versions of our model: EM +open and EM -open inaddition to the majority baseline described in Section 5:

• EM +open: The full model as described in Section 3and Section 4.

• EM -open: A subset of the EM +open model lack-ing the open-response component described in Sec-tion 3.2. In other words, during inference the modeldoes not leverage any information about the ability ofthe answer generator, and relies entirely on the mul-tiple choice responses to infer the correctness of theanswers.

6.1.1. PREDICTING ANSWER CORRECTNESS

Figure 5 depicts accuracy as function of the amount of la-beled data (accuracy was computed with respect to a gold-standard annotation of correctness for each answer, per-formed by one of the authors of the paper). From it weconclude that (i) the full model (EM +open) significantlyoutperforms both the majority baseline and EM -open, (ii)the EM +open performs very well without any labeled data(≈ 86% accuracy), (iii) adding labeled data improves per-formance, and (iv) the open-response component of themodel (one that is lacking in the EM -open model) is crit-ical in significantly boosting performance, i.e., incorporat-ing information about the answer creator is valuable in in-ferring the correctness of each answer.

Initialization: We also note that the initialization heuristicsuggested in Section 4 is critical to achieving competitiveperformance in the regime of little to no labeled data (Fig-ure 6). The performance of the model drops significantlybelow the majority baseline when a random initialization isused in place of the suggested heuristic.

6.1.2. PREDICTING STUDENT RANKING

Although predicting the correctness of each answer is it-self a valuable intermediate output, a motivating use-caseof our framework is to assess the students’ competency. Aranking of the students by their expertise is one exampleof summative assessment, and may be valuable in identi-fying students that excel or are in need of additional help.We evaluate the quality of the rankings produced by ourmodel in the following way: (i) use the gold-standard an-notation for the correctness of each answer to fit a standardRasch model, identifying the abilities sgold of each student


(a) open response task (b) multiple choice task

Figure 4. Screenshot of a segment from each of the two Mechanical Turk tasks. Workers are required to provide an open responseanswer to each question in the open response task, and select (click) all answers that apply in the multiple choice task. The choices inthe multiple choice task are aggregated from the open-response submission of other workers as part of the open-response task.

(both in Sopen and Smcq), (ii) obtain the ability parametersusing our model (EM +open and EM -open) (trained witha varying amount of labeled data) and (iii) rank the studentsaccording to each set of parameters and compute rank cor-relation. We use Kendall-Tau as a metric of rank correla-tion. Kendall Tau returns a quantity in the range [−1,+1],where +1 indicates perfect correlation (every pair of stu-dents in both rankings are in a consistent order), −1 whenthe rankings are inverted, and 0 when the rankings are notcorrelated.

Figure 7 and Figure 8 depict rank correlation as a func-tion of the amount of labeled answers for the students insets Sopen (workers in the open response task) and Smcq(workers in the multiple choice task) respectively. We ob-serve that (i) incorporating partially labeled set of answersimproves rank correlation, (ii) the EM +open model per-forms superior to or on par with the majority baseline (notethat EM -open is not relevant when ranking the students inthe Sopen set).

6.1.3. EFFECT OF CLASSROOM SIZE

In a practical setting, it is important to consider the effect ofclassroom size on the quality of the inferred parameters. In-tuitively, we expect that increasing the number of studentsanswering multiple choice questions |Smcq| will improveperformance (accuracy and rank correlation). Figure 9 de-picts accuracy as a function of |Smcq| (number of studentsthat answer multiple choice questions), for two conditionsbased on the amount of partially-labeled answers available.As expected, we observe that the performance of the model(EM +open) increases when more students participate in

0.0 0.2 0.4 0.6 0.8

Fraction of labeled answers

0.80

0.82

0.84

0.86

0.88

0.90

Acc

urac

y(o

nun

labe

led

answ

ers)

EM +openEM -openMajority

Figure 5. Accuracy in predicting the correct answers in the datasetcollected on Mechanical Turk. The model that incorporates boththe open-response and multiple choice components (EM +open)significantly outperforms the model that only incorporates themultiple choice component (EM -open) and a simple majority-vote baseline.

answering MCQs, and the gain becomes more pronouncedwith a greater number of labeled answers.

7. Ongoing work: automatic answerclustering

As we noted in Section 3, in a realistic setting students arelikely to submit redundant open-response answers, a signif-icant consequence of which is the violation of conditionalindependence that is assumed in the existing model. A


0.0 0.2 0.4 0.6 0.8


0.80

0.82

0.84

0.86

0.88

0.90

Acc

urac

y(o

nun

labe

led

answ

ers)

EM +openEM +open (rand init)Majority

Figure 6. Accuracy in predicting the correct answers in the datasetcollected on Mechanical Turk for the model initialized with theheuristic described in Section 4 (EM +open) and the model ini-tialized randomly (EM +open (rand init)). Good initializationsignificantly improves performance, especially in the regime oflittle to no partially labeled data.

0.0 0.2 0.4 0.6 0.8 1.0


0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

Ken

dall

Tau

(ope

n-re

spon

sest

uden

ts)

EM +openMajority

Figure 7. Rank correlation (kendall-tau) for students submittingopen-response answers (Sopen) between the model-inferred rank-ing (EM +open) and the ranking obtained using the gold-standardcorrectness labels for each answer (via the Rasch model). Themodel generates high quality rankings with little to no labeleddata, significantly outperforming the majority baseline where stu-dents are ranked using the parameters obtained from the Raschmodel, but where the correctness of each answer is obtained via amajority vote).

0.0 0.2 0.4 0.6 0.8


0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Ken

dall

Tau

(MC

Qst

uden

ts)


Figure 8. Rank correlation (kendall-tau) for students submittingmultiple choice answers (Smcq) between the model-inferred rank-ing (EM +open and EM -open) and the ranking obtained us-ing the gold-standard correctness labels for each answer (via theRasch model). In contrast to the rank correlation for students sub-mitting open-response answers (Figure 7), the rank correlation formultiple choice students is lower.

principled way of identifying related answers and groupingthose answers under into a single choice, as a byproduct,would make grading more efficient.

In practice, we expect that the number of truly original an-swers to a question to be significantly smaller than the num-ber of individual students who answer the question. An-swers between students, however, might vary in their phras-ing or surface-realization (e.g., simplified vs. unsimplifiedmath expression, short-answer paraphrases, equivalent dia-grams drawn in different ways, etc). It is, therefore, naturalto think of answer creation from the perspective of a con-ditional mixture distribution over a smaller set of “latentanswers,” where for every answer, the student with a cer-tain probability either contributes a variant of an existinganswer or an original answer. The Chinese restaurant pro-cess (CRP) is one such conditional distribution over “latentanswers,” with a property that the number of latent answers(clusters) grows logarithmically with the number of obser-vations. This is appealing in our setting where as the num-ber of submitted answers increases, the number of originalanswers is also expected to grow, but at a diminishing rate(i.e., after many contributions, most submissions will be avariant of an existing answer, rather than an original an-swer).

We are developing a model that is able to identify the latentclusters of answers based entirely on the pattern of mul-tiple choice responses (“clickthrough”), i.e., without con-sidering the content of the answers (as done in (Basu et al.,2013; Brooks et al., 2014; Lan et al., 2015). This makes the


10 20 30 40 50

Number of MCQ students

0.70

0.75

0.80

0.85

0.90A

ccur

acy

(on

unla

bele

dan

swer

s)10% of labeled answers


(a)

10 20 30 40 50

Number of MCQ students

0.70

0.75

0.80

0.85

0.90

Acc

urac

y(o

nun

labe

led

answ

ers)

40% of labeled answers


(b)

Figure 9. Accuracy in predicting the correct answers in the dataset collected on Mechanical Turk as a function of the number of studentsanswering multiple choice questions (|Smcq|). More students answering multiple choice questions improves the performance of themodel (EM +open) in relation to the majority baseline.

approach versatile in its application to any domain, e.g.,clustering images, math or language. The model is ableto identify answer clusters by relying on the observationof choices that tend to get selected together. We leave adetailed description and analysis of this model for futurework. In Figure 10, we illustrate an example set of answersclustered using this model (experiments performed on Me-chanical Turk).

8. ConclusionIn this work, we have developed a novel frameworkfor joint grading and assessment (JAG), which—as webelieve—offers a powerful alternative to classical peer-grading. The advantage of the proposed JAG frameworkover traditional peer-grading is that it naturally fuses test-taking and grading into a unified, streamlined process.More importantly, JAG opens the door to a natural wayto automatically cluster open-response submissions—a keychallenge towards the long-standing goal of scaling assess-ment to large-scale classrooms.

AcknowledgmentsThe work of I. Labutov was supported in part by a grantfrom the John Templeton Foundation provided through theMetaknowledge Network at the University of Chicago. Thework of C. Studer was supported in part by Xilinx Inc.and by the US NSF under grants ECCS-1408006 and CCF-1535897.

ReferencesBachrach, Yoram, Graepel, Thore, Minka, Tom, and

Guiver, John. How to grade a test without know-ing the answers—a bayesian graphical model for adap-tive crowdsourcing and aptitude testing. arXiv preprintarXiv:1206.6386, 2012.

Basu, Sumit, Jacobs, Chuck, and Vanderwende, Lucy.Powergrading: a clustering approach to amplify humaneffort for short answer grading. Transactions of the Asso-ciation for Computational Linguistics, 1:391–402, 2013.

Brooks, Michael, Basu, Sumit, Jacobs, Charles, and Van-derwende, Lucy. Divide and correct: using clusters tograde short answers at scale. In Proceedings of the firstACM conference on Learning@ scale conference, pp.89–98. ACM, 2014.

Dawid, Alexander Philip and Skene, Allan M. Maximumlikelihood estimation of observer error-rates using theem algorithm. Applied statistics, pp. 20–28, 1979.

Haladyna, Thomas M. Writing Test Items To EvaluateHigher Order Thinking. ERIC, 1997.

Hung, Nguyen Quoc Viet, Tam, Nguyen Thanh, Tran,Lam Ngoc, and Aberer, Karl. An evaluation of aggre-gation techniques in crowdsourcing. In Web InformationSystems Engineering–WISE 2013, pp. 1–15. Springer,2013.

Kulkarni, Chinmay, Wei, Koh Pang, Le, Huy, Chia, Daniel,Papadopoulos, Kathryn, Cheng, Justin, Koller, Daphne,and Klemmer, Scott R. Peer and self assessment in mas-sive online classes. In Design Thinking Research, pp.131–168. Springer, 2015.


Figure 10. Example of clusters learned from the patterns of multiple choice answers alone. The correlation matrix on the left depicts theindividual answers that the model believes belong to the same underlying “latent answer” (red = high probability, blue = low probability).Each answer is listed on the right, grouped according to the maximum a-posterior clustering configuration. Values on the right indicatewhether the respective answer cluster is believed to be a correct answer by the model (green = correct (> 0), red = wrong (< 0)).

Lan, Andrew S, Vats, Divyanshu, Waters, Andrew E, andBaraniuk, Richard G. Mathematical language process-ing: Automatic grading and feedback for open responsemathematical questions. In Proceedings of the Second(2015) ACM Conference on Learning@ Scale, pp. 167–176. ACM, 2015.

Piech, Chris, Huang, Jonathan, Chen, Zhenghao, Do,Chuong, Ng, Andrew, and Koller, Daphne. Tunedmodels of peer assessment in moocs. arXiv preprintarXiv:1307.2579, 2013.

Raman, Karthik and Joachims, Thorsten. Methods for or-dinal peer grading. In Proceedings of the 20th ACMSIGKDD international conference on Knowledge dis-covery and data mining, pp. 1037–1046. ACM, 2014.

Rasch, Georg. Probabilistic models for some intelligenceand attainment tests. ERIC, 1993.

Rodriguez, Michael C. Three options are optimal formultiple-choice items: A meta-analysis of 80 years ofresearch. Educational Measurement: Issues and Prac-tice, 24(2):3–13, 2005.

Roediger III, Henry L and Marsh, Elizabeth J. The positiveand negative consequences of multiple-choice testing.Journal of Experimental Psychology: Learning, Mem-ory, and Cognition, 31(5):1155, 2005.

Shah, Nihar B, Bradley, Joseph, Balakrishnan, Sivaraman,Parekh, Abhay, Ramchandran, Kannan, and Wainwright,Martin J. Some scaling laws for mooc assessments. InKDD Workshop on Data Mining for Educational Assess-ment and Feedback (ASSESS 2014), 2014.

Whitehill, Jacob, Wu, Ting-fan, Bergsma, Jacob, Movel-lan, Javier R, and Ruvolo, Paul L. Whose vote shouldcount more: Optimal integration of labels from labelersof unknown expertise. In Advances in neural informa-tion processing systems, pp. 2035–2043, 2009.

Wu, William, Daskalakis, Constantinos, Kaashoek, Nico-laas, Tzamos, Christos, and Weinberg, Matthew. Gametheory based peer grading mechanisms for moocs. 2015.

Zhu, Ciyou, Byrd, Richard H, Lu, Peihuang, and Nocedal,Jorge. Algorithm 778: L-bfgs-b: Fortran subroutinesfor large-scale bound-constrained optimization. ACMTransactions on Mathematical Software (TOMS), 23(4):550–560, 1997.

jag: joint assessment and gradingmedianetlab.ee.ucla.edu/papers/icmlws6.pdf · jag: joint...

Documents