worcester polytechnic institute towards assessing students’ fine grained knowledge: using an...

Post on 25-Dec-2015

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Worcester Polytechnic Institute

Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing

Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing

Mingyu FengAugust 18th, 2009

Ph.D. Dissertation Committee: Prof. Neil T. Heffernan (WPI) Prof. Carolina Ruiz (WPI) Prof. Joseph E. Beck (WPI) Prof. Kenneth R. Koedinger (CMU)

2

Motivation – the needMotivation – the need

Concerns about poor student performance on new state tests

High-stakes standards-based tests are required by the No Child Left Behind (NCLB) Act

Student performance are not satisfactory Massachusetts (2003, 20% failed 10th grade math on the first try) Worcester

Secondary teachers are asked to be data-driven MCAS test reports Formative assessment and practice tests

Provided by Northwest Evaluation Association; Measured Progress; Pearson Assessments, etc.

333

Motivation – the problemsMotivation – the problems

I: Formative assessment takes time from instruction NCLB or NCLU (No Child Left Untested)? Every hour spent assessing students is an hour lost

from instruction Limited classroom time compels teachers to make

a choice

44

Motivation – the problemsMotivation – the problems

II: Performance reports are not satisfactory Teachers want more frequent and more detailed reports

Confrey, J., Valenzuela, A., & Ortiz, A. (2002). Recommendation to the Texas State Board of Education on the Setting of TAKS Standards: A Call to Responsible Action. At http://www.syrce.org/State_Board.htm

5

Main ContributionsMain Contributions Improved assessment system by taking into account how

much assistance students need (WWW’06; ITS’06; EDM’08; UMUAI Journal’09 (nominated for James Chen award))

Established a way to track and predict performance longitudinally over multiple years (WWW’06; EDM’08)

Rigorously evaluated the effectiveness of the skill models of various granularities (AAAI’06 EDM Workshop; TICL’07; IEEE Journal’09)

Used data mining approach to evaluate effectiveness of individual contents (AIED’09)

Used data mining to refine existing skill models (EDM’09; in preparation)

Developed an online reporting system deployed and used by real teachers (AIED’05; Book chapter’07; TICL Journal’06; JILR Juornal’07)

6

RoadmapRoadmap

MotivationContributionsBackground - ASSISTmentUsing tutoring system as an assessor

Dynamic assessment Longitudinal modeling Cognitive diagnostic modeling

Conclusion & general implications

77

ASSISTments SystemASSISTments System

A web-based tutoring system that assists students in learning mathematics and gives teachers assessment of their students’ progress

Teachers like ASSISTments

Students like ASSISTments

8

We break multi-step items (original question) into scaffolding questions

Attempt: student take an action to answer a question

Response: the correctness of student answer (1/0)

Hint Messages: given on demand that give hints about what step to do next

Buggy Message: a context sensitive feedback message

Skill: a piece of knowledge required to answer a question

An ASSISTmentAn ASSISTment

99

Facts about ASSISTments Facts about ASSISTments

5000+ students have used the system regularlyMore than 10 million data records collectedOther features

Learning experiments; authoring tools, account and class management toolkit …

The dissertation uses data of about 1000 students who used ASSISTments during 2004-2006

AIED’05: Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N.T., Koedinger, K. R., Junker, B., Ritter, S., Knight, A., Aniszczyk, C., Choksey, S., Livak, T., Mercado, E., Turner, T.E., Upalekar. R, Walonoski, J.A., Macasek. M.A., Rasmussen, K.P. (2005). The Assistment Project: Blending Assessment and Assisting. In C.K. Looi, G. McCalla, B. Bredeweg, & J. Breuker (Eds.) Proceedings of the 12th International Conference on Artificial Intelligence in Education, pp. 555-562. Amsterdam: ISO Press.

Book Chapter: Razzaq, L., Feng, M., Heffernan, N., Koedinger, K., Nuzzo-Jones, G., Junker, B., Macasek, M., Rasmussen, K., Turner, T., & Walonoski, J. (2007). Blending Assessment and Instructional Assistance. In Nedjah, Mourelle, Borges and Almeida (Eds). Intelligent Educational Machines within the Intelligent Systems Engineering Book Series . pp.23-49. Springer Berlin / Heidelberg.

10

RoadmapRoadmap

MotivationContributionsBackground - ASSISTmentsUsing tutoring system as an assessor

Dynamic assessment Longitudinal modeling Cognitive diagnostic modeling

Conclusion & general implications

11

A Grade Book ReportA Grade Book Report

JILR Journal: Feng, M. & Heffernan, N. (2007). Towards Live Informing and Automatic Analyzing of Student Learning: Reporting in the Assistment System. Journal of Interactive Learning Research. 18 (2), pp. 207-230. Chesapeake, VA: AACE.

TICL Journal: Feng, M., Heffernan, N.T. (2006). Informing Teachers Live about Student Learning: Reporting in the Assistment System. Technology, Instruction, Cognition, and Learning Journal. Vol. 3. Old City Publishing, Philadelphia, PA. 2006.

Where does this score come

from?

Automated AssessmentAutomated Assessment

Big idea: use data collected while a student uses ASSISTment to assess him

Lots of types of data available (last screen just used % correct on original

questions) Lots of other possible measures

Why should we be more complicated?

Worcester Polytechnic Institute

12

13

A Grade Book ReportA Grade Book Report

Static – does not distinguish “Tom” and “Jack”

Average – ignores development over time

Uninformative – not informative for classroom instruction

Dynamic assessment

Longitudinal modeling

Cognitive diagnostic assessment

1414

Dynamic Assessment – the ideaDynamic Assessment – the idea

Brown, A. L., Bryant, N.R., & Campione, J. C. (1983). Preschool children’s learning and transfer of matrices problems: Potential for improvement. Paper presented at the Society for Research in Child Development meetings, Detroit.

Dynamic testing began before computerized testing (Brown, Bryant, & Campione, 1983).

1515

Dynamic vs. Static AssessmentDynamic vs. Static Assessment

Developing dynamic testing metrics # attempts # minutes to come up with an answer; # minutes to

complete an ASSISTment # hint requests; # hint-before-attempt requests;

#bottom-out hints % correct on scaffolds # problems solved

“Static” measure correct/wrong on original questions

1616

Dynamic Assessment – dataDynamic Assessment – data

2004-2005 Data Sept, 2004 – May, 2005 391 students Online data

267 minutes (sd. = 79); 9 days; 147 items (sd. = 60) 8th grade MCAS scores (May, 2005)

2005-2006 Data Sept, 2005 – May, 2006 616 students Online data

196 minutes (sd. = 76); 6 days; 88 items (sd. = 42) 8th grade MCAS scores (May, 2006)

17

Three linear stepwise regression models

17

Dynamic Assessment - modelingDynamic Assessment - modeling

1-parameter IRT proficiency

estimate

All onlinemetrics

1-parameter IRT proficiency estimate + all online

metrics

The standard test model

The assistance modelThe mixed model

1-parameter IRT: One parameter item response theory model

MCASScore

18

Bayesian Information Criterion (BIC) Widely used model selection criterion Resolves overfitting problem by introducing a penalty term

for the number of parameters Formula Prefer model with lower BIC

Mean Absolute Deviation (MAD) Cross-validated prediction error Function

Prefer model with lower MAD18

Dynamic Assessment - evaluationDynamic Assessment - evaluation

Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111-163.

1919

Dynamic Assessment - resultsDynamic Assessment - results

1-parameter IRT proficiency

estimate

All onlinemetrics

1-parameter IRT proficiency estimate + all online

metrics

The standard test model

The assistance model

The mixed model

Model MAD BICCorrelation with 2005

8th grade MCAS Model MAD BIC

Correlation with 2005 8th grade MCAS

The standard test model 6.40 -295 0.733

The assistance model 5.46 -402 0.821 p=0.001

Model MAD BICCorrelation with 2005

8th grade MCAS

The standard test model 6.40 -295 0.733

The assistance model 5.46 -402 0.821

The mixed model 5.04 -450 0.841 p=0.001

p=0.001

2020

Dynamic Assessment – what variables are important?

Dynamic Assessment – what variables are important?

2121

Dynamic Assessment - robustnessDynamic Assessment - robustness

See if model can generalize Test model on other year’s data

Compare Models from Two YearsCompare Models from Two Years

Worcester Polytechnic Institute

22

Which metrics are stable across years?

2004-2005 data 2005-2006 data(Constant) 32.414 3.284IRT_Proficiency_Estimate 26.8 32.944Scaffold_Percent_Correct 20.427 21.327Avg_Question_Time -0.17 -0.102Avg_Attempt -10.5  Avg_Hint_Request -3.217  Question_Count   0.072Avg_Item_Time   0.045Total_Attempt   -0.044

23

Dynamic Assessment - conclusionDynamic Assessment - conclusion

ASSISTments data enables us to assess more accurately

The relative success of the assistance model over the standard test model highlights the power of the dynamic measures

Feng, M., Heffernan, N.T, Koedinger, K.R. (2006a). Addressing the Testing Challenge with a Web-Based E-Assessment System that Tutors as it Assesses. In Proceedings of the 15th International World Wide Web Conference. pp. 307-316. New York, NY: ACM Press. 2006. Best Student Paper Nominee. Feng, M., Heffernan, N.T., & Koedinger, K.R. (2009). Addressing the assessment challenge in an online System that tutors as it assesses. User Modeling and User-Adapted Interaction: The Journal of Personalization Research (UMUAI journal). 19(3), 2009.

24

RoadmapRoadmap

MotivationContributionsBackground - ASSISTmentsUsing tutoring system as an assessor

Dynamic assessment Longitudinal modeling Cognitive diagnostic modeling

Conclusion & general implications

2525

Can we have our cake and eat it, too?Can we have our cake and eat it, too?

Most large standardized tests are unidimensional or low-dimensional.

Yet, teachers need fine grained diagnostic reports (Militello, Sireci, & Schweid, 2008; Wylie, & Ciofalo, 2008; Stiggins, 2005)

Can we have our cake and eat it, too?

Militello, M., Sireci, S., & Schweid, J. (2008). Intent, purpose, and fit: An examination of formative assessment systems in school districts. Paper presented at the American Educational Research Association, New York City, NY.

Wylie, E. C., & Ciofalo, J. (2008). Supporting teachers' use of individual diagnostic items. Teachers College Record. Retrieved from http://www.tcrecord.org/PrintContent.asp?ContentID=15363 on October 13, 2008.

Stiggins, R. (2005). From formative assessment to assessment FOR learning: A path to success in standards-based schools. Phi Delta Kappan, 87(4), 324-328.

2626

Cognitive Diagnostic AssessmentCognitive Diagnostic Assessment

McCalla & Greer (1994) pointed out that the ability to represent and reason about knowledge at various levels of detail is important for robust tutoring.

Griel, Wang & Zhou (2008) proposed one direction for future research is to increase understanding of how to select an appropriate grain size or level of analysis

Can we use MCAS test results to help select the right grain-sized model from a series of models of different granularities?

McCalla, G. I. and Greer, J. E. (1994). Granularity- based reasoning and belief revision in student models. In Greer, J. E. and McCalla, G. I., (eds), Student Modeling: The Key to Individualized Knowledge-Based Instruction, pages 39-62. Springer-Verlag, Berlin. Gierl, M.J., Wang, C., & Zhou, J. (2008). Using the attribute hierarchy method to make diagnostic inferences about examinees’ cognitive skills in Algebra on the SAT. Journal of Technology, Learning, and Assessment, 6(6).

2727

Building Skill ModelsBuilding Skill Models Math WPI - 1

WPI - 5

Patterns,Relations,and Algebra Geometry Measurement Number Sense

and Operations

Data Analysis, Statistics and Probability …

Using-measurement-formulas-and-techniques

Setting-up-and-solving-equation

Understanding-pattern

Understanding-data-presentation-techniques

Understanding-and-applying-congruence-and-similarity

Converting-from-one-measure-to-another

understanding-number-representations

WPI - 39

… … … …

WPI - 78

Ordering-fractions

Equation-solving

Equation-concept

Inducing-function

Plot-graph

XY-graph

Congruence

Similar-triangles

Perimeter

Area

Circle-graph

Unit-conversion

Equivalent-Fractions-Decimals-Percents

… … … … … ……

2828

Building Skill ModelsBuilding Skill Models Math

WPI - 5

WPI - 1

Patterns,Relations,and Algebra Geometry Measurement Number Sense

and Operations

Data Analysis, Statistics and Probability …

Using-measurement-formulas-and-techniques

Setting-up-and-solving-equation

Understanding-pattern

Understanding-data-presentation-techniques

Understanding-and-applying-congruence-and-similarity

Converting-from-one-measure-to-another

understanding-number-representations

WPI - 39

… … … …

WPI - 78

Ordering-fractions

Equation-solving

Equation-concept

Inducing-function

Plot-graph

XY-graph

Congruence

Similar-triangles

Perimeter

Area

Circle-graph

Unit-conversion

Equivalent-Fractions-Decimals-Percents

… … … … … ……

2929

Cognitive Diagnostic Assessment – dataCognitive Diagnostic Assessment – data

2004-2005 Data Sept, 2004 – May, 2005 447 students Online data: 7.3 days; 87 items (sd. = 35)

Item level response of 8th grade MCAS test (May, 2005) 2005-2006 Data

Sept, 2005 – May, 2006 474 students Online data: 5 days; 51 items (sd. = 24)

Item level 8th grade MCAS scores (May, 2006) All online and MCAS items have been tagged with all

four skill models

30

Cognitive Diagnostic Assessment - modelingCognitive Diagnostic Assessment - modeling Fit mixed-effects logistic regression model

Predict MCAS score Extrapolate the fitted model in time to the month of the MCAS test Obtain probability of getting each MCAS question correct, based upon

skill tagging of the MCAS item Sum up probabilities to get total score

30

-- Xijkt is the 0/1 response of student i on question j tapping skill k in month t-- Montht is elapsed month in the study; 0 for September, 1 for October, and so on-- β0k and β1k : respective fixed effects for baseline and rate of change in probability of correctly answering a question tapping skill k. -- β00 and β10 : the group average incoming knowledge level and rate of change-- β0 and β1 : the baseline level of achievement and rate of change of the student

Longitudinal model (e.g. Singer & Willett, 2003)

Absolute Difference

WPI-1 WPI-5 WPI-39 WPI-78

1.69 2.15 2.82 4.53

2.34 2.85 3.33 4.87

0.54 0.77 1.15 2.74

0.59 1.30 1.88 3.70

1.33 0.58 0.02 1.86

31

How do I Evaluate Models?How do I Evaluate Models?

04-05Data

Real MCAS score

ASSISTment Predicted Score

Skill Models WPI-1 WPI-5 WPI-39 WPI-78

Mary 25.00 23.31 22.85 22.18 20.47

Tom 32.00 29.66 29.15 28.67 27.13

Sue 29.00 28.46 28.23 27.85 26.26

Dick 28.00 27.41 26.70 26.12 24.30

Harry 22.00 23.33 22.58 22.02 20.14

MAD 4.42 4.37 4.22 4.11

%Error 13.00% 12.85% 12.41% 12.09%

Paired two-sample t-test

32

P =0.21P <0.001P =0.006

Comparing Models of Different GranularitiesComparing Models of Different Granularities

4.67

13.70%

4.36

12.83%

P =0.10

1-parameter IRT model

04-05 Data WPI-1 WPI-5 WPI-39 WPI-78

MAD 4.42 4.37 4.22 4.11

%Error 13.00% 12.85% 12.41% 12.09%> >> >

>>

05-06 Data WPI-1 WPI-5 WPI-39 WPI-78

MAD 6.58 6.51 4.83 4.99

%Error 19.37% 19.14% 15.10% 14.70%

P <0.001P <0.001P <0.001 P =0.03

The Effect of Scaffolding - hypothesisThe Effect of Scaffolding - hypothesis

Only using original questions makes it hard to decide which skill to “blame”

Scaffolding questions aid in diagnosis by directly assessing a single skill

Hypotheses Using responses to scaffolding questions will

improve prediction accuracy Scaffolding questions are more useful for fine

grained models33

The Effect of Scaffolding - resultsThe Effect of Scaffolding - results

04-05 Data

Only original questions used

WPI-1 14.91%WPI-5 14.06%WPI-39 15.29%WPI-78 17.75%

34

Original + Scaffolding questions used

13.00%

12.85%

12.41%

12.09%

05-06 Data

Only original questions used

WPI-1 20.05%WPI-5 19.88%WPI-39 18.68%WPI-78 16.91%

Original + Scaffolding questions used

19.37%

19.14%

15.10%

14.70%

35

Cognitive Diagnostic Assessment - usageCognitive Diagnostic Assessment - usage

Results presented in a nested structure of different granularities to serve a variety of stake-holders

36

Cognitive Diagnostic Assessment - conclusionCognitive Diagnostic Assessment - conclusion

Fine-grained models do the best job estimating student skill level overall

Not necessarily the best for all consumers (e.g. principals)

Need ability to diagnosis (e.g. scaffolding questions) Scaffolding questions

Helps improve overall prediction accuracy More useful for fine-grained models

Feng, M., Heffernan, N.T, Mani, M. & Heffernan C. (2006). Using Mixed-Effects Modeling to Compare Different Grain-Sized Skill Models. In Beck, J., Aimeur, E., & Barnes, T. (Eds). Educational Data Mining: Papers from the AAAI Workshop. Menlo Park, CA: AAAI Press. pp. 57-66. Feng, M, Heffernan, N., Heffernan, C. & Mani, M. (2009). Using mixed-effects modeling to analyze different grain-sized skill models. IEEE Transactions on Learning Technologies Special Issue on Real-World Applications of Intelligent Tutoring Systems. (Featured article of the issue)Pardos, Z., Feng, M. & Heffernan, N. T. & Heffernan-Lindquist, C. (2007).Analyzing fine-grained skill models using bayesian and mixed effect methods. In Luckin & Koedinger (Eds.) Proceedings of the 13th Conference on Artificial Intelligence in Education. Amsterdam, Netherlands: IOS Press.pp.626-628.

37

Future Work - Skill Model RefinementFuture Work - Skill Model Refinement

We found that WPI-78 is good enough to better predict a state test than some less fine-grained models

However, WPI-78 may have some mis-taggings Expert-built models are subject to the risk of “expert blind

spot” Our best-guess in a 7-hour coding session

A best guess model should be iteratively tested and refined

38

Skill Model Refinement - approaches Skill Model Refinement - approaches

Human experts manually update hand-crafted models (1,000+ items ) * (100+ skills) Not practical to do it often

Data mining can help Skills or items with high residuals Skills consistently over-predicted or under-predicted “Un-learned” skills (i.e. negative slopes from mixed-

effects models)

Feng, M., Heffernan, N., Beck, J, & Koedinger, K. (2008). Can we predict which groups of questions students will learn from? In Beck & Baker (Eds.). Proceedings of the 1st International Conference on Education Data Mining. Montreal, 2008.

39

Searching for better models automatically Learning Factor Analysis (LFA) (Koedinger, & Junker,

1999) A semi-automated method Three parts

Difficulty factors associated with problems A combinatorial search space by applying operators (add, split,

merge) on the base model A statistical model that evaluate how a model fit the data

Can we increase the efficiency of LFA?

Skill Model Refinement - approachesSkill Model Refinement - approaches

Human identify difficulty factors

through task analysis

Auto-methods search for better

models based upon factors

Auto-methods search for better

models based upon factors

40

Suggesting Difficulty FactorsSuggesting Difficulty Factors

Some items in a random sequence cause significantly less learning than others

Hypothesis Problems that “don’t help”

students learn might be teaching a different skill(s)

Create factor tables Preliminary results show

some validity

Feng, M., Heffernan, N., & Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment system. In Dimitrova, Mizoguchi, du Boulay, & Graesser (Eds), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED-2009). Amsterdam, Netherlands: IOS Press. Brighton, UK.

Skill Factor

Circle-area High

Circle-area High

Circle-area High

Circle-area Low

41

RoadmapRoadmap

MotivationContributionsBackground - ASSISTmentsUsing tutoring system as an assessor

Dynamic assessment Longitudinal modeling Cognitive diagnostic modeling

Conclusion & general implications

4242

Conclusion of the DissertationConclusion of the Dissertation

The dissertation establishes novel assessment methods to better assess students in tutoring systems

Assess students better by analyzing their learning behaviors when using the tutor

Assess students longitudinally by tracking learning over time

Assess students diagnostically by modeling fine- grained skills

4343

Comments from the Education SecretaryComments from the Education Secretary

Secretary of Education, Arne Duncan weighed in (in Feb 2009) on the NCLB Act, and called for continuous assessment

Duncan says he is concerned about overtesting but he thinks states could solve the problem by developing better tests. He also wants to help them develop better data management systems that help teachers track individual student progress. "If you have great assessments and real-time data for teachers and parents that say these are [the student's] strengths and weaknesses, that's a real healthy thing," he says.

Ramírez, E., & Clark, K. (Feb., 2009). What Arne Duncan Thinks of No Child Left Behind: The new education secretary talks about the controversial law and financial aid forms. (Electronic version) Retrieved on March 8th, 2009 from http://www.usnews.com/articles/education/2009/02/05/what-arne-duncan-thinks-of-no-child-left-behind.html.

4444

General implicationGeneral implication

Continuous assessment systems are possible to build (we built one)

Save classroom instruction time by assessing students during tutoring

Track individual progress and help stakeholders get student performance information

Provide teachers with fine-grained, cognitively diagnostic feedbacks to be “data-driven”

45

A metaphor for this shiftA metaphor for this shift

Committee on the Foundations of Assessment Board on Testing and AssessmentCenter for Education National Research CouncilJames W. Pellegrino Naomi ChudowskyRobert Glaser

(page 284).

Businesses don’t close down periodically to take inventory of stock any more

Bar code; auto-checkout Non-stopped business Richer information

4646

AcknowledgementAcknowledgement

My advisor Neil Heffernan

Committee members Ken Koedinger Carolina Ruiz Joe Beck

The ASSISTment team My familyMany more…

Worcester Polytechnic Institute

Thanks!

Questions?

4848

Backup slidesBackup slides

49

Motivation – the problemsMotivation – the problems

III: The “moving” target problem Testing and instruction have been separate fields

of research with their own goals Psychometric theory assumes a fixed target for

measurement ITS wants student ability to “move”

50

More ContributionsMore Contributions

Working systems www.ASSISTment.org The reporting system that gives cognitive diagnostic

reports to teachers in a timely fashion Establish an easy approach to detect the effectiveness

of individual tutoring content

AIED’05: Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N.T., Koedinger, K. R., Junker, B., Ritter, S., Knight, A., Aniszczyk, C., Choksey, S., Livak, T., Mercado, E., Turner, T.E., Upalekar. R, Walonoski, J.A., Macasek. M.A., Rasmussen, K.P. (2005). The Assistment Project: Blending Assessment and Assisting. In C.K. Looi, G. McCalla, B. Bredeweg, & J. Breuker (Eds.) Proceedings of the 12th International Conference on Artificial Intelligence in Education, pp. 555-562. Amsterdam: ISO Press.

Book Chapter: Razzaq, L., Feng, M., Heffernan, N., Koedinger, K., Nuzzo-Jones, G., Junker, B., Macasek, M., Rasmussen, K., Turner, T., & Walonoski, J. (2007). Blending Assessment and Instructional Assistance. In Nedjah, Mourelle, Borges and Almeida (Eds). Intelligent Educational Machines within the Intelligent Systems Engineering Book Series . pp.23-49. Springer Berlin / Heidelberg.

JILR Journal: Feng, M. & Heffernan, N. (2007). Towards Live Informing and Automatic Analyzing of Student Learning: Reporting in the Assistment System. Journal of Interactive Learning Research. 18 (2), pp. 207-230. Chesapeake, VA: AACE.

TICL Journal: Feng, M., Heffernan, N.T. (2006). Informing Teachers Live about Student Learning: Reporting in the Assistment System. Technology, Instruction, Cognition, and Learning Journal. Vol. 3. Old City Publishing, Philadelphia, PA. 2006.

AIED’09: Feng, M., Heffernan, N.T., Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment system. In Dimitrova, Mizoguchi, du Boulay, and Grasser (Eds), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED-2009). pp. 523-530. Amsterdam, Netherlands: IOS Press.

51

EvidenceEvidence

62% 50% 37% 37%

52

EvidenceEvidence

1. Congruence2. Perimeter3. Equation-Solving

5353

TerminologyTerminology

MCAS Item/question/problem Response Original question Scaffolding question Hint message Bottom-out hint Buggy message

Attempt Skill/knowledge

component Skill model/cognitive

model/Q-matrix Single mapping model Multi-mapping model

5454

5555

Worcester Polytechnic Institute

55

The reporting systemThe reporting system

I developed the first reporting system for ASSISTments in 2004 that

is online, live, and gives detailed feedback at a grain size for guiding instruction

5656

The grade bookThe grade book

“It’s spooky; he’s watching everything we do”. – a student

5757

Identifying difficult stepsIdentifying difficult steps

5858

Informing hard skillsInforming hard skills

59

Linear Regression ModelLinear Regression Model

An approach to modeling relationship between one or more variables (y) and one or more variables (X)

Y depends linearly on X

How linear regression works? Minimizing sum-of-squares Example of linear regression

with one independent variable

Stepwise regression Forward; backward; Combination

Worcester Polytechnic Institute

59

60

1-Parameter IRT Model1-Parameter IRT Model

Item response theory (IRT) model relates the probability of an examinee's response to a test item to an underlying ability in a logistic function

1-PL IRT model

where βn is the ability of person n and δi is the difficulty of item i.

I used BI-LOG MG to run the model and get estimate of student ability and item difficulty

Worcester Polytechnic Institute

60

6161

Dynamic assessment - The modelsDynamic assessment - The models

6262

Dynamic assessment - The modelsDynamic assessment - The models

6363

Dynamic assessment – The modelsDynamic assessment – The models

6464

Dynamic assessment - ValidationDynamic assessment - Validation

6565

Longitudinal Modeling - dataLongitudinal Modeling - data

Average %correct on original questions over time (FAKE data)

What does our real data look like?

66

67

0.00

9.00

18.00

27.00

36.00

45.00

54.00

239 240 243 244 245

246 247 248 314 315

316 320 321 327 331

666 667 668 669 805

806 807 809 810

0.00

9.00

18.00

27.00

36.00

45.00

54.00

0.00

9.00

18.00

27.00

36.00

45.00

54.00

0.00

9.00

18.00

27.00

36.00

45.00

54.00

0 2 4 6 8

CenteredMonth

0.00

9.00

18.00

27.00

36.00

45.00

54.00

0 2 4 6 8

CenteredMonth

0 2 4 6 8

CenteredMonth

0 2 4 6 8

CenteredMonth

6868

6868

What do we get from (linear) mixed effects models?

Average population trajectory for the specified group Trajectory indicated by two parameters

intercept: slope: The average estimated score for a group at time j is

One trajectory for every single student Each student got two parameters to vary from

the group average Intercept: slope:

The estimated score for student i at time j is

jj TIME*1000

jiiij TIME*)()( 110000

00 10

i000 i110

Longitudinal Modeling - methodologyLongitudinal Modeling - methodology

Singer, J. D. & Willett, J. B. (2003). Applied Longitudinal Data Analysis: Modeling Change and Occurrence. Oxford University Press, New York.

69

Longitudinal Modeling - resultsLongitudinal Modeling - results

BIC: Bayesian Information Criterion(the lower, the better)

Feng, M., Heffernan, N.T, Koedinger, K.R. (2006a) Addressing the Testing Challenge with a Web-Based E-Assessment System that Tutors as it Assesses. In Proceedings of the 15th International World Wide Web Conference. pp. 307-316. New York, NY: ACM Press. 2006. Best Student Paper Nominee. Feng, M., Heffernan, N.T, Koedinger, K.R. (2006b). Predicting State Test Scores Better with Intelligent Tutoring Systems: Developing Metrics to Measure Assistance Required. In Ikeda, Ashley & Chan (Eds.). Proceedings of the 8th International Conference on Intelligent Tutoring Systems. Springer-Verlag: Berlin. pp. 31-40. 2006.

7070

Mixed effects modelsMixed effects models

Individuals in the population are assumed to have their own subject-specific mean response trajectories over time

The mean response is modeled as a combination of population characteristics (fixed effects) and subject-specific effects that are unique to a particular individual (random effects)

It is possible to predict how individual response trajectories change over time

Flexibility in accommodating imbalance in longitudinal data

Methodological features: 1) 3 or more waves of data 2) an outcome variable (dependent variable) whose values change systematically over time 3) A sensible metric for time that is the fundamental predictor in the longitudinal study

7171

Sample longitudinal dataSample longitudinal data

72

Comparison of ApproachesComparison of Approaches

Ayers & Junker (2006) Estimate student proficiency using

1-PL IRT model LLTM (linear logistic test model)

Main question difficulty decomposed into K skills

1-PL IRT fits dramatically better Only main questions used Additive, non-temporal WinBUGS

Worcester Polytechnic Institute

72

73

Comparison of ApproachesComparison of Approaches

Pardos et al. (2006) Conjunctive Bayes nets Non-temporal Scaffolding used Bayes Net Toolbox (Murphy, 2001)

DINA model

(Anozie, 2006)

Worcester Polytechnic Institute

73

74

Comparison of ApproachesComparison of Approaches

Feng, Heffernan, Mani & Heffernan (2006) Logistic mixed-effects model (Generalized Linear Mixed-

effects Model, GLMM) Temporal Xi j is the 0/1 response of student i on question j tapping

KC k in month t,

R lme4 library

Worcester Polytechnic Institute

74

Montht is elapsed month in the study; β0k and β1k are respective fixed effects for baseline and rate of change in probability of correctly answering a question tapping KC k.

75

Comparison of ApproachesComparison of Approaches

Comparing to LLTM in Ayers & Junker (2006) Student proficiency depends on time

Question difficulty depends on KC and time

Assign only the most difficult skill instead of full Q-matrix mapping of multiple skills as in LLTM

Scaffolding used to gain identifiability Ayers & Junker (2006) use regression to predict MCAS after

obtaining estimate of student ability (θ) (MAD= 10.93%) No such regression process in my work

logit(p=1) = θ – 0; estimated score = full score * p Higher MAD, but provide diagnostic information

Worcester Polytechnic Institute

75

76

Comparison of ApproachesComparison of Approaches

Comparing to Bayes nets and conjunctive models Bayes: probability reasoning; conjunctive GLMM: linear learning; max-difficulty reduction Computationally much easier and faster Results are still comparable

GLMM is better than Bayes nets when WPI-1, WPI-5 used GLMM is comparable with Bayes nets when WPI-39 or WPI-

78 used WPI-39: GLMM 12.41%, Bayes: 12.05% WPI-78: GLMM 12.09%, Bayes: 13.75%

Worcester Polytechnic Institute

76

77

Cognitive Diagnostic Assessment – BIC resultsCognitive Diagnostic Assessment – BIC results

BIC

#data points are different Items tagged with more than one skill will be duplicated

in the data Finer grained models have more multi-mappings, and

thus, more data points (higher BIC) WPI-5 better than WPI-1; WPI-78 better than WPI-39

Calculate MAD as the evaluation gauge

Worcester Polytechnic Institute

77

Model WPI-1 WPI-5 WPI-39 WPI-78

04-05 Data 173445.2 170359.9 170581.7 165711.4

05-06 Data 39210.57 39174.29 54696.4 54299.54

3085 -222 4870

36 -15522 399

78

Analyzing Instructional EffectivenessAnalyzing Instructional Effectiveness

44332211 ****)(1

)(ln tBtBtBtBItemStudenta

correctP

correctP

Feng, M., Heffernan, N., & Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment system. In Dimitrova, Mizoguchi, du Boulay, & Graesser (Eds), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED-2009). Amsterdam, Netherlands: IOS Press. Brighton, UK.

Prior encounters

1

0

0

1

Correct?

1

1

1

0

t1

011Tom

010Tom

000Tom

000Tom

t4t3t2

Item Student

Detect relative instructional effectiveness among items in the same GLOP using learning decomposition.

79

Searching ResultsSearching Results

Among 38 GLOPs, LFA found significant better models for 12

Shall I be happy? “Sanity” check: random

assigned factor tables

#items in GLOP (#GLOPs)

Learning- suggested factors

Random factor table

2 (11) 5 5

3 (5)

4 (7) 3 1

5-11 (15) 4 (5, 6, 8, 9) 1 (5)

Further works need to be done Quantitatively measure whether and how data analysis

results can be helpful for subject-matter experts Explore the automatic factor assigning approach on

more data for other systems Contrast with human experts as controlled condition

80

Guess which item is the most difficult one?

Log likelihood -532.6 -524

Bayesian Information Criterion 1,079.2 1,065.99

Num of skills 1 2

Num of parameters 2 4

Coefficients 1.099, 0.137 1.841, 0.100; -0.927, 0.055

Item IDSquare-

rootFactor-

High

894 1 0

41 1 1

4673 1 1

117 1 1

top related