worcester polytechnic institute towards assessing students’ fine grained knowledge: using an...
TRANSCRIPT
Worcester Polytechnic Institute
Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing
Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing
Mingyu FengAugust 18th, 2009
Ph.D. Dissertation Committee: Prof. Neil T. Heffernan (WPI) Prof. Carolina Ruiz (WPI) Prof. Joseph E. Beck (WPI) Prof. Kenneth R. Koedinger (CMU)
2
Motivation – the needMotivation – the need
Concerns about poor student performance on new state tests
High-stakes standards-based tests are required by the No Child Left Behind (NCLB) Act
Student performance are not satisfactory Massachusetts (2003, 20% failed 10th grade math on the first try) Worcester
Secondary teachers are asked to be data-driven MCAS test reports Formative assessment and practice tests
Provided by Northwest Evaluation Association; Measured Progress; Pearson Assessments, etc.
333
Motivation – the problemsMotivation – the problems
I: Formative assessment takes time from instruction NCLB or NCLU (No Child Left Untested)? Every hour spent assessing students is an hour lost
from instruction Limited classroom time compels teachers to make
a choice
44
Motivation – the problemsMotivation – the problems
II: Performance reports are not satisfactory Teachers want more frequent and more detailed reports
Confrey, J., Valenzuela, A., & Ortiz, A. (2002). Recommendation to the Texas State Board of Education on the Setting of TAKS Standards: A Call to Responsible Action. At http://www.syrce.org/State_Board.htm
5
Main ContributionsMain Contributions Improved assessment system by taking into account how
much assistance students need (WWW’06; ITS’06; EDM’08; UMUAI Journal’09 (nominated for James Chen award))
Established a way to track and predict performance longitudinally over multiple years (WWW’06; EDM’08)
Rigorously evaluated the effectiveness of the skill models of various granularities (AAAI’06 EDM Workshop; TICL’07; IEEE Journal’09)
Used data mining approach to evaluate effectiveness of individual contents (AIED’09)
Used data mining to refine existing skill models (EDM’09; in preparation)
Developed an online reporting system deployed and used by real teachers (AIED’05; Book chapter’07; TICL Journal’06; JILR Juornal’07)
6
RoadmapRoadmap
MotivationContributionsBackground - ASSISTmentUsing tutoring system as an assessor
Dynamic assessment Longitudinal modeling Cognitive diagnostic modeling
Conclusion & general implications
77
ASSISTments SystemASSISTments System
A web-based tutoring system that assists students in learning mathematics and gives teachers assessment of their students’ progress
Teachers like ASSISTments
Students like ASSISTments
8
We break multi-step items (original question) into scaffolding questions
Attempt: student take an action to answer a question
Response: the correctness of student answer (1/0)
Hint Messages: given on demand that give hints about what step to do next
Buggy Message: a context sensitive feedback message
Skill: a piece of knowledge required to answer a question
An ASSISTmentAn ASSISTment
99
Facts about ASSISTments Facts about ASSISTments
5000+ students have used the system regularlyMore than 10 million data records collectedOther features
Learning experiments; authoring tools, account and class management toolkit …
The dissertation uses data of about 1000 students who used ASSISTments during 2004-2006
AIED’05: Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N.T., Koedinger, K. R., Junker, B., Ritter, S., Knight, A., Aniszczyk, C., Choksey, S., Livak, T., Mercado, E., Turner, T.E., Upalekar. R, Walonoski, J.A., Macasek. M.A., Rasmussen, K.P. (2005). The Assistment Project: Blending Assessment and Assisting. In C.K. Looi, G. McCalla, B. Bredeweg, & J. Breuker (Eds.) Proceedings of the 12th International Conference on Artificial Intelligence in Education, pp. 555-562. Amsterdam: ISO Press.
Book Chapter: Razzaq, L., Feng, M., Heffernan, N., Koedinger, K., Nuzzo-Jones, G., Junker, B., Macasek, M., Rasmussen, K., Turner, T., & Walonoski, J. (2007). Blending Assessment and Instructional Assistance. In Nedjah, Mourelle, Borges and Almeida (Eds). Intelligent Educational Machines within the Intelligent Systems Engineering Book Series . pp.23-49. Springer Berlin / Heidelberg.
10
RoadmapRoadmap
MotivationContributionsBackground - ASSISTmentsUsing tutoring system as an assessor
Dynamic assessment Longitudinal modeling Cognitive diagnostic modeling
Conclusion & general implications
11
A Grade Book ReportA Grade Book Report
JILR Journal: Feng, M. & Heffernan, N. (2007). Towards Live Informing and Automatic Analyzing of Student Learning: Reporting in the Assistment System. Journal of Interactive Learning Research. 18 (2), pp. 207-230. Chesapeake, VA: AACE.
TICL Journal: Feng, M., Heffernan, N.T. (2006). Informing Teachers Live about Student Learning: Reporting in the Assistment System. Technology, Instruction, Cognition, and Learning Journal. Vol. 3. Old City Publishing, Philadelphia, PA. 2006.
Where does this score come
from?
Automated AssessmentAutomated Assessment
Big idea: use data collected while a student uses ASSISTment to assess him
Lots of types of data available (last screen just used % correct on original
questions) Lots of other possible measures
Why should we be more complicated?
Worcester Polytechnic Institute
12
13
A Grade Book ReportA Grade Book Report
Static – does not distinguish “Tom” and “Jack”
Average – ignores development over time
Uninformative – not informative for classroom instruction
Dynamic assessment
Longitudinal modeling
Cognitive diagnostic assessment
1414
Dynamic Assessment – the ideaDynamic Assessment – the idea
Brown, A. L., Bryant, N.R., & Campione, J. C. (1983). Preschool children’s learning and transfer of matrices problems: Potential for improvement. Paper presented at the Society for Research in Child Development meetings, Detroit.
Dynamic testing began before computerized testing (Brown, Bryant, & Campione, 1983).
1515
Dynamic vs. Static AssessmentDynamic vs. Static Assessment
Developing dynamic testing metrics # attempts # minutes to come up with an answer; # minutes to
complete an ASSISTment # hint requests; # hint-before-attempt requests;
#bottom-out hints % correct on scaffolds # problems solved
“Static” measure correct/wrong on original questions
1616
Dynamic Assessment – dataDynamic Assessment – data
2004-2005 Data Sept, 2004 – May, 2005 391 students Online data
267 minutes (sd. = 79); 9 days; 147 items (sd. = 60) 8th grade MCAS scores (May, 2005)
2005-2006 Data Sept, 2005 – May, 2006 616 students Online data
196 minutes (sd. = 76); 6 days; 88 items (sd. = 42) 8th grade MCAS scores (May, 2006)
17
Three linear stepwise regression models
17
Dynamic Assessment - modelingDynamic Assessment - modeling
1-parameter IRT proficiency
estimate
All onlinemetrics
1-parameter IRT proficiency estimate + all online
metrics
The standard test model
The assistance modelThe mixed model
1-parameter IRT: One parameter item response theory model
MCASScore
18
Bayesian Information Criterion (BIC) Widely used model selection criterion Resolves overfitting problem by introducing a penalty term
for the number of parameters Formula Prefer model with lower BIC
Mean Absolute Deviation (MAD) Cross-validated prediction error Function
Prefer model with lower MAD18
Dynamic Assessment - evaluationDynamic Assessment - evaluation
Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111-163.
1919
Dynamic Assessment - resultsDynamic Assessment - results
1-parameter IRT proficiency
estimate
All onlinemetrics
1-parameter IRT proficiency estimate + all online
metrics
The standard test model
The assistance model
The mixed model
Model MAD BICCorrelation with 2005
8th grade MCAS Model MAD BIC
Correlation with 2005 8th grade MCAS
The standard test model 6.40 -295 0.733
The assistance model 5.46 -402 0.821 p=0.001
Model MAD BICCorrelation with 2005
8th grade MCAS
The standard test model 6.40 -295 0.733
The assistance model 5.46 -402 0.821
The mixed model 5.04 -450 0.841 p=0.001
p=0.001
2020
Dynamic Assessment – what variables are important?
Dynamic Assessment – what variables are important?
2121
Dynamic Assessment - robustnessDynamic Assessment - robustness
See if model can generalize Test model on other year’s data
Compare Models from Two YearsCompare Models from Two Years
Worcester Polytechnic Institute
22
Which metrics are stable across years?
2004-2005 data 2005-2006 data(Constant) 32.414 3.284IRT_Proficiency_Estimate 26.8 32.944Scaffold_Percent_Correct 20.427 21.327Avg_Question_Time -0.17 -0.102Avg_Attempt -10.5 Avg_Hint_Request -3.217 Question_Count 0.072Avg_Item_Time 0.045Total_Attempt -0.044
23
Dynamic Assessment - conclusionDynamic Assessment - conclusion
ASSISTments data enables us to assess more accurately
The relative success of the assistance model over the standard test model highlights the power of the dynamic measures
Feng, M., Heffernan, N.T, Koedinger, K.R. (2006a). Addressing the Testing Challenge with a Web-Based E-Assessment System that Tutors as it Assesses. In Proceedings of the 15th International World Wide Web Conference. pp. 307-316. New York, NY: ACM Press. 2006. Best Student Paper Nominee. Feng, M., Heffernan, N.T., & Koedinger, K.R. (2009). Addressing the assessment challenge in an online System that tutors as it assesses. User Modeling and User-Adapted Interaction: The Journal of Personalization Research (UMUAI journal). 19(3), 2009.
24
RoadmapRoadmap
MotivationContributionsBackground - ASSISTmentsUsing tutoring system as an assessor
Dynamic assessment Longitudinal modeling Cognitive diagnostic modeling
Conclusion & general implications
2525
Can we have our cake and eat it, too?Can we have our cake and eat it, too?
Most large standardized tests are unidimensional or low-dimensional.
Yet, teachers need fine grained diagnostic reports (Militello, Sireci, & Schweid, 2008; Wylie, & Ciofalo, 2008; Stiggins, 2005)
Can we have our cake and eat it, too?
Militello, M., Sireci, S., & Schweid, J. (2008). Intent, purpose, and fit: An examination of formative assessment systems in school districts. Paper presented at the American Educational Research Association, New York City, NY.
Wylie, E. C., & Ciofalo, J. (2008). Supporting teachers' use of individual diagnostic items. Teachers College Record. Retrieved from http://www.tcrecord.org/PrintContent.asp?ContentID=15363 on October 13, 2008.
Stiggins, R. (2005). From formative assessment to assessment FOR learning: A path to success in standards-based schools. Phi Delta Kappan, 87(4), 324-328.
2626
Cognitive Diagnostic AssessmentCognitive Diagnostic Assessment
McCalla & Greer (1994) pointed out that the ability to represent and reason about knowledge at various levels of detail is important for robust tutoring.
Griel, Wang & Zhou (2008) proposed one direction for future research is to increase understanding of how to select an appropriate grain size or level of analysis
Can we use MCAS test results to help select the right grain-sized model from a series of models of different granularities?
McCalla, G. I. and Greer, J. E. (1994). Granularity- based reasoning and belief revision in student models. In Greer, J. E. and McCalla, G. I., (eds), Student Modeling: The Key to Individualized Knowledge-Based Instruction, pages 39-62. Springer-Verlag, Berlin. Gierl, M.J., Wang, C., & Zhou, J. (2008). Using the attribute hierarchy method to make diagnostic inferences about examinees’ cognitive skills in Algebra on the SAT. Journal of Technology, Learning, and Assessment, 6(6).
2727
Building Skill ModelsBuilding Skill Models Math WPI - 1
WPI - 5
Patterns,Relations,and Algebra Geometry Measurement Number Sense
and Operations
Data Analysis, Statistics and Probability …
Using-measurement-formulas-and-techniques
Setting-up-and-solving-equation
Understanding-pattern
Understanding-data-presentation-techniques
Understanding-and-applying-congruence-and-similarity
Converting-from-one-measure-to-another
understanding-number-representations
WPI - 39
… … … …
WPI - 78
Ordering-fractions
Equation-solving
Equation-concept
Inducing-function
Plot-graph
XY-graph
Congruence
Similar-triangles
Perimeter
Area
Circle-graph
Unit-conversion
Equivalent-Fractions-Decimals-Percents
… … … … … ……
2828
Building Skill ModelsBuilding Skill Models Math
WPI - 5
WPI - 1
Patterns,Relations,and Algebra Geometry Measurement Number Sense
and Operations
Data Analysis, Statistics and Probability …
Using-measurement-formulas-and-techniques
Setting-up-and-solving-equation
Understanding-pattern
Understanding-data-presentation-techniques
Understanding-and-applying-congruence-and-similarity
Converting-from-one-measure-to-another
understanding-number-representations
WPI - 39
… … … …
WPI - 78
Ordering-fractions
Equation-solving
Equation-concept
Inducing-function
Plot-graph
XY-graph
Congruence
Similar-triangles
Perimeter
Area
Circle-graph
Unit-conversion
Equivalent-Fractions-Decimals-Percents
… … … … … ……
2929
Cognitive Diagnostic Assessment – dataCognitive Diagnostic Assessment – data
2004-2005 Data Sept, 2004 – May, 2005 447 students Online data: 7.3 days; 87 items (sd. = 35)
Item level response of 8th grade MCAS test (May, 2005) 2005-2006 Data
Sept, 2005 – May, 2006 474 students Online data: 5 days; 51 items (sd. = 24)
Item level 8th grade MCAS scores (May, 2006) All online and MCAS items have been tagged with all
four skill models
30
Cognitive Diagnostic Assessment - modelingCognitive Diagnostic Assessment - modeling Fit mixed-effects logistic regression model
Predict MCAS score Extrapolate the fitted model in time to the month of the MCAS test Obtain probability of getting each MCAS question correct, based upon
skill tagging of the MCAS item Sum up probabilities to get total score
30
-- Xijkt is the 0/1 response of student i on question j tapping skill k in month t-- Montht is elapsed month in the study; 0 for September, 1 for October, and so on-- β0k and β1k : respective fixed effects for baseline and rate of change in probability of correctly answering a question tapping skill k. -- β00 and β10 : the group average incoming knowledge level and rate of change-- β0 and β1 : the baseline level of achievement and rate of change of the student
Longitudinal model (e.g. Singer & Willett, 2003)
Absolute Difference
WPI-1 WPI-5 WPI-39 WPI-78
1.69 2.15 2.82 4.53
2.34 2.85 3.33 4.87
…
0.54 0.77 1.15 2.74
0.59 1.30 1.88 3.70
1.33 0.58 0.02 1.86
31
How do I Evaluate Models?How do I Evaluate Models?
04-05Data
Real MCAS score
ASSISTment Predicted Score
Skill Models WPI-1 WPI-5 WPI-39 WPI-78
Mary 25.00 23.31 22.85 22.18 20.47
Tom 32.00 29.66 29.15 28.67 27.13
…
Sue 29.00 28.46 28.23 27.85 26.26
Dick 28.00 27.41 26.70 26.12 24.30
Harry 22.00 23.33 22.58 22.02 20.14
MAD 4.42 4.37 4.22 4.11
%Error 13.00% 12.85% 12.41% 12.09%
Paired two-sample t-test
32
P =0.21P <0.001P =0.006
Comparing Models of Different GranularitiesComparing Models of Different Granularities
4.67
13.70%
4.36
12.83%
P =0.10
1-parameter IRT model
04-05 Data WPI-1 WPI-5 WPI-39 WPI-78
MAD 4.42 4.37 4.22 4.11
%Error 13.00% 12.85% 12.41% 12.09%> >> >
>>
05-06 Data WPI-1 WPI-5 WPI-39 WPI-78
MAD 6.58 6.51 4.83 4.99
%Error 19.37% 19.14% 15.10% 14.70%
P <0.001P <0.001P <0.001 P =0.03
The Effect of Scaffolding - hypothesisThe Effect of Scaffolding - hypothesis
Only using original questions makes it hard to decide which skill to “blame”
Scaffolding questions aid in diagnosis by directly assessing a single skill
Hypotheses Using responses to scaffolding questions will
improve prediction accuracy Scaffolding questions are more useful for fine
grained models33
The Effect of Scaffolding - resultsThe Effect of Scaffolding - results
04-05 Data
Only original questions used
WPI-1 14.91%WPI-5 14.06%WPI-39 15.29%WPI-78 17.75%
34
Original + Scaffolding questions used
13.00%
12.85%
12.41%
12.09%
05-06 Data
Only original questions used
WPI-1 20.05%WPI-5 19.88%WPI-39 18.68%WPI-78 16.91%
Original + Scaffolding questions used
19.37%
19.14%
15.10%
14.70%
35
Cognitive Diagnostic Assessment - usageCognitive Diagnostic Assessment - usage
Results presented in a nested structure of different granularities to serve a variety of stake-holders
36
Cognitive Diagnostic Assessment - conclusionCognitive Diagnostic Assessment - conclusion
Fine-grained models do the best job estimating student skill level overall
Not necessarily the best for all consumers (e.g. principals)
Need ability to diagnosis (e.g. scaffolding questions) Scaffolding questions
Helps improve overall prediction accuracy More useful for fine-grained models
Feng, M., Heffernan, N.T, Mani, M. & Heffernan C. (2006). Using Mixed-Effects Modeling to Compare Different Grain-Sized Skill Models. In Beck, J., Aimeur, E., & Barnes, T. (Eds). Educational Data Mining: Papers from the AAAI Workshop. Menlo Park, CA: AAAI Press. pp. 57-66. Feng, M, Heffernan, N., Heffernan, C. & Mani, M. (2009). Using mixed-effects modeling to analyze different grain-sized skill models. IEEE Transactions on Learning Technologies Special Issue on Real-World Applications of Intelligent Tutoring Systems. (Featured article of the issue)Pardos, Z., Feng, M. & Heffernan, N. T. & Heffernan-Lindquist, C. (2007).Analyzing fine-grained skill models using bayesian and mixed effect methods. In Luckin & Koedinger (Eds.) Proceedings of the 13th Conference on Artificial Intelligence in Education. Amsterdam, Netherlands: IOS Press.pp.626-628.
37
Future Work - Skill Model RefinementFuture Work - Skill Model Refinement
We found that WPI-78 is good enough to better predict a state test than some less fine-grained models
However, WPI-78 may have some mis-taggings Expert-built models are subject to the risk of “expert blind
spot” Our best-guess in a 7-hour coding session
A best guess model should be iteratively tested and refined
38
Skill Model Refinement - approaches Skill Model Refinement - approaches
Human experts manually update hand-crafted models (1,000+ items ) * (100+ skills) Not practical to do it often
Data mining can help Skills or items with high residuals Skills consistently over-predicted or under-predicted “Un-learned” skills (i.e. negative slopes from mixed-
effects models)
Feng, M., Heffernan, N., Beck, J, & Koedinger, K. (2008). Can we predict which groups of questions students will learn from? In Beck & Baker (Eds.). Proceedings of the 1st International Conference on Education Data Mining. Montreal, 2008.
39
Searching for better models automatically Learning Factor Analysis (LFA) (Koedinger, & Junker,
1999) A semi-automated method Three parts
Difficulty factors associated with problems A combinatorial search space by applying operators (add, split,
merge) on the base model A statistical model that evaluate how a model fit the data
Can we increase the efficiency of LFA?
Skill Model Refinement - approachesSkill Model Refinement - approaches
Human identify difficulty factors
through task analysis
Auto-methods search for better
models based upon factors
Auto-methods search for better
models based upon factors
40
Suggesting Difficulty FactorsSuggesting Difficulty Factors
Some items in a random sequence cause significantly less learning than others
Hypothesis Problems that “don’t help”
students learn might be teaching a different skill(s)
Create factor tables Preliminary results show
some validity
Feng, M., Heffernan, N., & Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment system. In Dimitrova, Mizoguchi, du Boulay, & Graesser (Eds), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED-2009). Amsterdam, Netherlands: IOS Press. Brighton, UK.
Skill Factor
Circle-area High
Circle-area High
Circle-area High
Circle-area Low
41
RoadmapRoadmap
MotivationContributionsBackground - ASSISTmentsUsing tutoring system as an assessor
Dynamic assessment Longitudinal modeling Cognitive diagnostic modeling
Conclusion & general implications
4242
Conclusion of the DissertationConclusion of the Dissertation
The dissertation establishes novel assessment methods to better assess students in tutoring systems
Assess students better by analyzing their learning behaviors when using the tutor
Assess students longitudinally by tracking learning over time
Assess students diagnostically by modeling fine- grained skills
4343
Comments from the Education SecretaryComments from the Education Secretary
Secretary of Education, Arne Duncan weighed in (in Feb 2009) on the NCLB Act, and called for continuous assessment
Duncan says he is concerned about overtesting but he thinks states could solve the problem by developing better tests. He also wants to help them develop better data management systems that help teachers track individual student progress. "If you have great assessments and real-time data for teachers and parents that say these are [the student's] strengths and weaknesses, that's a real healthy thing," he says.
Ramírez, E., & Clark, K. (Feb., 2009). What Arne Duncan Thinks of No Child Left Behind: The new education secretary talks about the controversial law and financial aid forms. (Electronic version) Retrieved on March 8th, 2009 from http://www.usnews.com/articles/education/2009/02/05/what-arne-duncan-thinks-of-no-child-left-behind.html.
4444
General implicationGeneral implication
Continuous assessment systems are possible to build (we built one)
Save classroom instruction time by assessing students during tutoring
Track individual progress and help stakeholders get student performance information
Provide teachers with fine-grained, cognitively diagnostic feedbacks to be “data-driven”
45
A metaphor for this shiftA metaphor for this shift
Committee on the Foundations of Assessment Board on Testing and AssessmentCenter for Education National Research CouncilJames W. Pellegrino Naomi ChudowskyRobert Glaser
(page 284).
Businesses don’t close down periodically to take inventory of stock any more
Bar code; auto-checkout Non-stopped business Richer information
4646
AcknowledgementAcknowledgement
My advisor Neil Heffernan
Committee members Ken Koedinger Carolina Ruiz Joe Beck
The ASSISTment team My familyMany more…
Worcester Polytechnic Institute
Thanks!
Questions?
4848
Backup slidesBackup slides
49
Motivation – the problemsMotivation – the problems
III: The “moving” target problem Testing and instruction have been separate fields
of research with their own goals Psychometric theory assumes a fixed target for
measurement ITS wants student ability to “move”
50
More ContributionsMore Contributions
Working systems www.ASSISTment.org The reporting system that gives cognitive diagnostic
reports to teachers in a timely fashion Establish an easy approach to detect the effectiveness
of individual tutoring content
AIED’05: Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N.T., Koedinger, K. R., Junker, B., Ritter, S., Knight, A., Aniszczyk, C., Choksey, S., Livak, T., Mercado, E., Turner, T.E., Upalekar. R, Walonoski, J.A., Macasek. M.A., Rasmussen, K.P. (2005). The Assistment Project: Blending Assessment and Assisting. In C.K. Looi, G. McCalla, B. Bredeweg, & J. Breuker (Eds.) Proceedings of the 12th International Conference on Artificial Intelligence in Education, pp. 555-562. Amsterdam: ISO Press.
Book Chapter: Razzaq, L., Feng, M., Heffernan, N., Koedinger, K., Nuzzo-Jones, G., Junker, B., Macasek, M., Rasmussen, K., Turner, T., & Walonoski, J. (2007). Blending Assessment and Instructional Assistance. In Nedjah, Mourelle, Borges and Almeida (Eds). Intelligent Educational Machines within the Intelligent Systems Engineering Book Series . pp.23-49. Springer Berlin / Heidelberg.
JILR Journal: Feng, M. & Heffernan, N. (2007). Towards Live Informing and Automatic Analyzing of Student Learning: Reporting in the Assistment System. Journal of Interactive Learning Research. 18 (2), pp. 207-230. Chesapeake, VA: AACE.
TICL Journal: Feng, M., Heffernan, N.T. (2006). Informing Teachers Live about Student Learning: Reporting in the Assistment System. Technology, Instruction, Cognition, and Learning Journal. Vol. 3. Old City Publishing, Philadelphia, PA. 2006.
AIED’09: Feng, M., Heffernan, N.T., Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment system. In Dimitrova, Mizoguchi, du Boulay, and Grasser (Eds), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED-2009). pp. 523-530. Amsterdam, Netherlands: IOS Press.
51
EvidenceEvidence
62% 50% 37% 37%
52
EvidenceEvidence
1. Congruence2. Perimeter3. Equation-Solving
5353
TerminologyTerminology
MCAS Item/question/problem Response Original question Scaffolding question Hint message Bottom-out hint Buggy message
Attempt Skill/knowledge
component Skill model/cognitive
model/Q-matrix Single mapping model Multi-mapping model
5454
5555
Worcester Polytechnic Institute
55
The reporting systemThe reporting system
I developed the first reporting system for ASSISTments in 2004 that
is online, live, and gives detailed feedback at a grain size for guiding instruction
5656
The grade bookThe grade book
“It’s spooky; he’s watching everything we do”. – a student
5757
Identifying difficult stepsIdentifying difficult steps
5858
Informing hard skillsInforming hard skills
59
Linear Regression ModelLinear Regression Model
An approach to modeling relationship between one or more variables (y) and one or more variables (X)
Y depends linearly on X
How linear regression works? Minimizing sum-of-squares Example of linear regression
with one independent variable
Stepwise regression Forward; backward; Combination
Worcester Polytechnic Institute
59
60
1-Parameter IRT Model1-Parameter IRT Model
Item response theory (IRT) model relates the probability of an examinee's response to a test item to an underlying ability in a logistic function
1-PL IRT model
where βn is the ability of person n and δi is the difficulty of item i.
I used BI-LOG MG to run the model and get estimate of student ability and item difficulty
Worcester Polytechnic Institute
60
6161
Dynamic assessment - The modelsDynamic assessment - The models
6262
Dynamic assessment - The modelsDynamic assessment - The models
6363
Dynamic assessment – The modelsDynamic assessment – The models
6464
Dynamic assessment - ValidationDynamic assessment - Validation
6565
Longitudinal Modeling - dataLongitudinal Modeling - data
Average %correct on original questions over time (FAKE data)
What does our real data look like?
66
67
0.00
9.00
18.00
27.00
36.00
45.00
54.00
239 240 243 244 245
246 247 248 314 315
316 320 321 327 331
666 667 668 669 805
806 807 809 810
0.00
9.00
18.00
27.00
36.00
45.00
54.00
0.00
9.00
18.00
27.00
36.00
45.00
54.00
0.00
9.00
18.00
27.00
36.00
45.00
54.00
0 2 4 6 8
CenteredMonth
0.00
9.00
18.00
27.00
36.00
45.00
54.00
0 2 4 6 8
CenteredMonth
0 2 4 6 8
CenteredMonth
0 2 4 6 8
CenteredMonth
6868
6868
What do we get from (linear) mixed effects models?
Average population trajectory for the specified group Trajectory indicated by two parameters
intercept: slope: The average estimated score for a group at time j is
One trajectory for every single student Each student got two parameters to vary from
the group average Intercept: slope:
The estimated score for student i at time j is
jj TIME*1000
jiiij TIME*)()( 110000
00 10
i000 i110
Longitudinal Modeling - methodologyLongitudinal Modeling - methodology
Singer, J. D. & Willett, J. B. (2003). Applied Longitudinal Data Analysis: Modeling Change and Occurrence. Oxford University Press, New York.
69
Longitudinal Modeling - resultsLongitudinal Modeling - results
BIC: Bayesian Information Criterion(the lower, the better)
Feng, M., Heffernan, N.T, Koedinger, K.R. (2006a) Addressing the Testing Challenge with a Web-Based E-Assessment System that Tutors as it Assesses. In Proceedings of the 15th International World Wide Web Conference. pp. 307-316. New York, NY: ACM Press. 2006. Best Student Paper Nominee. Feng, M., Heffernan, N.T, Koedinger, K.R. (2006b). Predicting State Test Scores Better with Intelligent Tutoring Systems: Developing Metrics to Measure Assistance Required. In Ikeda, Ashley & Chan (Eds.). Proceedings of the 8th International Conference on Intelligent Tutoring Systems. Springer-Verlag: Berlin. pp. 31-40. 2006.
7070
Mixed effects modelsMixed effects models
Individuals in the population are assumed to have their own subject-specific mean response trajectories over time
The mean response is modeled as a combination of population characteristics (fixed effects) and subject-specific effects that are unique to a particular individual (random effects)
It is possible to predict how individual response trajectories change over time
Flexibility in accommodating imbalance in longitudinal data
Methodological features: 1) 3 or more waves of data 2) an outcome variable (dependent variable) whose values change systematically over time 3) A sensible metric for time that is the fundamental predictor in the longitudinal study
7171
Sample longitudinal dataSample longitudinal data
72
Comparison of ApproachesComparison of Approaches
Ayers & Junker (2006) Estimate student proficiency using
1-PL IRT model LLTM (linear logistic test model)
Main question difficulty decomposed into K skills
1-PL IRT fits dramatically better Only main questions used Additive, non-temporal WinBUGS
Worcester Polytechnic Institute
72
73
Comparison of ApproachesComparison of Approaches
Pardos et al. (2006) Conjunctive Bayes nets Non-temporal Scaffolding used Bayes Net Toolbox (Murphy, 2001)
DINA model
(Anozie, 2006)
Worcester Polytechnic Institute
73
74
Comparison of ApproachesComparison of Approaches
Feng, Heffernan, Mani & Heffernan (2006) Logistic mixed-effects model (Generalized Linear Mixed-
effects Model, GLMM) Temporal Xi j is the 0/1 response of student i on question j tapping
KC k in month t,
R lme4 library
Worcester Polytechnic Institute
74
Montht is elapsed month in the study; β0k and β1k are respective fixed effects for baseline and rate of change in probability of correctly answering a question tapping KC k.
75
Comparison of ApproachesComparison of Approaches
Comparing to LLTM in Ayers & Junker (2006) Student proficiency depends on time
Question difficulty depends on KC and time
Assign only the most difficult skill instead of full Q-matrix mapping of multiple skills as in LLTM
Scaffolding used to gain identifiability Ayers & Junker (2006) use regression to predict MCAS after
obtaining estimate of student ability (θ) (MAD= 10.93%) No such regression process in my work
logit(p=1) = θ – 0; estimated score = full score * p Higher MAD, but provide diagnostic information
Worcester Polytechnic Institute
75
76
Comparison of ApproachesComparison of Approaches
Comparing to Bayes nets and conjunctive models Bayes: probability reasoning; conjunctive GLMM: linear learning; max-difficulty reduction Computationally much easier and faster Results are still comparable
GLMM is better than Bayes nets when WPI-1, WPI-5 used GLMM is comparable with Bayes nets when WPI-39 or WPI-
78 used WPI-39: GLMM 12.41%, Bayes: 12.05% WPI-78: GLMM 12.09%, Bayes: 13.75%
Worcester Polytechnic Institute
76
77
Cognitive Diagnostic Assessment – BIC resultsCognitive Diagnostic Assessment – BIC results
BIC
#data points are different Items tagged with more than one skill will be duplicated
in the data Finer grained models have more multi-mappings, and
thus, more data points (higher BIC) WPI-5 better than WPI-1; WPI-78 better than WPI-39
Calculate MAD as the evaluation gauge
Worcester Polytechnic Institute
77
Model WPI-1 WPI-5 WPI-39 WPI-78
04-05 Data 173445.2 170359.9 170581.7 165711.4
05-06 Data 39210.57 39174.29 54696.4 54299.54
3085 -222 4870
36 -15522 399
78
Analyzing Instructional EffectivenessAnalyzing Instructional Effectiveness
44332211 ****)(1
)(ln tBtBtBtBItemStudenta
correctP
correctP
Feng, M., Heffernan, N., & Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment system. In Dimitrova, Mizoguchi, du Boulay, & Graesser (Eds), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED-2009). Amsterdam, Netherlands: IOS Press. Brighton, UK.
Prior encounters
1
0
0
1
Correct?
1
1
1
0
t1
011Tom
010Tom
000Tom
000Tom
t4t3t2
Item Student
Detect relative instructional effectiveness among items in the same GLOP using learning decomposition.
79
Searching ResultsSearching Results
Among 38 GLOPs, LFA found significant better models for 12
Shall I be happy? “Sanity” check: random
assigned factor tables
#items in GLOP (#GLOPs)
Learning- suggested factors
Random factor table
2 (11) 5 5
3 (5)
4 (7) 3 1
5-11 (15) 4 (5, 6, 8, 9) 1 (5)
Further works need to be done Quantitatively measure whether and how data analysis
results can be helpful for subject-matter experts Explore the automatic factor assigning approach on
more data for other systems Contrast with human experts as controlled condition
80
Guess which item is the most difficult one?
Log likelihood -532.6 -524
Bayesian Information Criterion 1,079.2 1,065.99
Num of skills 1 2
Num of parameters 2 4
Coefficients 1.099, 0.137 1.841, 0.100; -0.927, 0.055
Item IDSquare-
rootFactor-
High
894 1 0
41 1 1
4673 1 1
117 1 1