using semantic technologies for giving a formative assessment and supporting scoring in large...

39
Miguel Santamaría Lancho, Mauro Hernández, ,Angeles Sánchez-Elvira, José María Luzón Encabo, Guillermo de Jorge- Botana, UNED, Spain Using semantic technologies for giving a formative assessment and supporting scoring in large courses and MOOCs: first experiences at UNED (2015-2017)

Upload: uned

Post on 21-Jan-2018

70 views

Category:

Education


0 download

TRANSCRIPT

Miguel Santamaría Lancho, Mauro Hernández, ,Angeles Sánchez-Elvira, José María Luzón Encabo, Guillermo de Jorge-Botana,

UNED, Spain

Using semantic technologies for giving a formative assessment and supporting scoring in large courses and MOOCs: first experiences at UNED (2015-2017)

Department of EconomicHistory and Applied Economics

Department of Developmental and Educational Psychology

Economic History Teachers Team G-Rubric software developers

FACULTY OF ECONOMICS FACULTY OF PSYCHOLOGY

Miguel Santamaria José M. Luzón Guillermo de JorgeMauro Hernández

Our goal was to improve formative assessment in online courses giving personalised feedback

Department of Personality

Ángeles Sánchez-Elvira

G-Rubric user

Summary

1. Our challenge: How semantic technologies could help us to:• give personalised feedback on open-ended questions

• support our tutors to score TMAs in a more reliable way

2. What G-Rubrics is and how it works?

3. Analysis of our experiences giving automatic formative feedback on open-ended questions

4. Proposal about how G-Rubric could cope with problems related to manual grading

5. Results and conclusions

How to give personalised feedback on open-ended activities

•Personalising learning

• Fostering performance improvement

• Increasing motivation

01/11/2017 [email protected] 5

FEEDBACK IS THE KEY FACTOR FOR

Wich is the kind of feedback that our students expect?

•Quick

• Iterative• They love learning by trial and error

01/11/2017 [email protected] 6

CHARACTERISTICS OF EXPECTED FEEDBACK

Only technology can provide this kind of feedback

Feedback based on technologies offers limited solutions

At classroom

• “clickers”

• (Socrative, Kahoot)

01/11/2017 [email protected] 7

In e-learning platforms• Quizzes• Adaptive quizzes

Quizzes have severe limitations to assess learning outcomes on economic history field

Our challenge was how to give:• quick and iterative feedback• for open-ended questions• in a sustainable way• by using technologies

• Knowledge about Economic History

• Soft skills:• Analysis• Critical thinking

• Multiple choice questions• Open-ended short questions

about concepts, historical processes, etc

• Writing comments of texts, maps, graphs, statistical data

LEARNING OUTCOMES ASSESSMENT ACTIVITIES

WHAT G-RUBRIC IS AND HOW IT WORKS

2nd step

3rd step

1st step To build up a specialized linguistic corpus and a Semantic Space

6 Economic History textbooks

Semantic SpaceCorpus

Activities based on short open-ended questions should be developed

To deliver the activities to our students we use a web interface

Students

Web interface

IN-built rubric space

To implement G-Rubric into a subject we need to follow 3 steps

Answer

Feedback

Canon answer

Example of a G-Rubric open-ended activity

Question

Canon answerOr Golden text

Conceptual axes

Mercantilism: policies and objectives.

“Mercantilism is a set of ideas and policies deployed in early modern Europe (16th, 17th and 18th centuries) aimed at strengthening the State through economic power, and specially focused on trade-balance surpluses and accumulation of precious metals (bullionism).The are several types of policies, emphasizing: a) those focused on obtaining trade balance surpluses (tariff protectionism, prohibition on exporting gold or silver or raw materials, privileged trading companies, shipping records, colonial monopolies); B) promotion of manufactures (import tariffs or prohibitions, laws against luxury, real manufactures); C) other policies: favoring the birth rate, limitation or rate of interior prices.They are often associated with the names of Colbert in France, or the English or Dutch companies of India (VOC).

Definition : mercantilism, ideas, practices, state, economy, monarchy, strengthen, reinforce, increase, trade balance, favorable, bullonism, precious metals, gold, silver, privileges.Trade policies: trade, protectionism, tariffs, prohibition, exports, imports, privilegedcompanies, records of navigation, colonies, monopoly, fleet, merchant, surplusManufacturing policies: manufactures, factories, real, luxury, import substitutionContext: Europe, England, France, Holland, Colbert, XVI, XVII, XVIII, modern, VOC, East Indies, West Indies.

An example to understand how G-Rubric works

01/11/2017 [email protected] 12

G-Rubric web interface

The student selects an activity

01/11/2017 [email protected] 13

1.-Mercantilism2.- Triangular Trade3.- Coal and Ind. Rev.4.- Gerschenkron5.- Second Industrial Revolution6.- Consequences of IWW7.- Bretton Woods

1.- Mercantilism

The student introduces the answer

01/11/2017 [email protected] 14

“Mercantilism is a set of ideas and policies deployed in early modern Europe (16th, 17th and 18th centuries) aimed at strengthening the State through economic power, and specially focused on trade-balance surpluses and accumulation of precious metals (bullionism).

After submitting an answeer the students receive feedbackconsisting of

01/11/2017 [email protected] 15

“Mercantilism is a set of ideas and policies deployed in early modern Europe (16th, 17th and 18th centuries) aimed at strengthening the State through economic power, and specially focused on trade-balance surpluses and accumulation of precious metals (bulionism).

Content grade

Graphical feedback

Style grade

Acceptancearea

Definition

Trade

Manufact

Context

Grammaticalaccuracy

to what extent the answer is correct.

After checking th feedback

01/11/2017 [email protected] 16

The student improves their answer by adding new information

“Mercantilism is a set of ideas and policies deployed in early modern Europe (16th, 17th and 18th centuries) aimed at strengthening the State through economic power, and specially focused on trade-balance surpluses and accumulation of precious metals (bulionism).Amongst mercantilist polices, some outstand, i.e. those focused on attaining surpluses in trade balance through tariff protection, prohibition of exports of gold, silver and raw materials, creation of chartered trade companies, navigation acts and commercial monopolies”.

A new feedback is provided

01/11/2017 [email protected] 17

The content grade grow-up

the answers for each conceptual axis get closer to the acceptance area

EXPERIENCES CARRY OUT BETWEEN 2015-2016Providing personalized formative assessment

Experiences using G-Rubrics in 2015 and 2016

• The trials carried out were focused on providing formative assessment

• Our goal was to promote deep learning through iterative feedback

• G-Rubric offers two main advantages regarding formative assessment: • It allows as many attempts as lecturers set • gives the students immediate rich feedback

• All trials have been conducted with first year Business Administration Degree students

Two experiences (2015 and 2016): goals

• Could Grubrics be able to give accurate feedback?

• Could the feedback allow an improvement on following answers?

• Could rich feedback increase the time devoted to the activity?

OUR QUESTIONS

• The impact on theirmotivation

• The utility to prepare thefinal exam

• The level of agreement withthe grades received

STUDENTS OPINIONS ABOUT

2015: 132 Volunteers 2016: 120 Volunteers

The enriched graphical feedback increases:• The number of trials performed by the students • The amount of time devoted to the task

Content grade improvement

01/11/2017 [email protected] 2101/11/[email protected]

21

The average percentage score increases between first and last attempt

Activity 1 Activity 3Activity 2 Activity 4 Activity 5 Activity 6 Activity 7

We could verify how students using the feedback could improve their answers

Students’ agreement with the grades received

The level of agreement was bigger in the last trial

First trial47%

very much or totally agree

Last trial70%

very much or totally agree

G-Rubric had a positive impact on students’ motivation

Totally or very much: 65%

Totally or very much: 60%

Usefulness and positive value

The 80 % of studentsconsidered Grubric totallyor very much usefulregarding exampreparation

More than 80 % of students considered thisexperience very much or

totally positive

BEYOND FORMATIVE ASSESSMENT: HOW SEMANTIC TECHNOLOGIES CAN HELP TUTORS TO MARK TMAs

Are humans reliable to mark open ended questions?

• Inter-examiners variability depending on whomarked the task

• Intra-examiner reliability depending on when the same tutor marked the task

Students view manual grading of open-ended questions as subjective➢ In contrast automated test assessement is perceived as

more objective

Manual grading has almost two problems:

Accidentally double grading (2012 & 2013)Two members of the academic team, independently and unknowingly, graded the same exams.

• The differential was in an average of 1,5 points over 8

• Final grade differed substantially > 37,5% not obtain a passing grade

-1,5

-1

-0,5

0

0,5

1

1,5

2

2,5

3

3,5

4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

"Essay+short questions grade differential"

Essay grade differential

Figure 5. Differential in grades for doubly-assessed exams (June 2012)*

*Referred to 24 Econonic History final exams from Barcelona-CUXAM Regional Center (June 2012)

Accidentally double grading (2013)

-1

-0,5

0

0,5

1

1,5

2

2,5

3

3,5

4

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77

Essay+Short Questions differential (Grade 2-Grade 1)

Figure 6. Differential in grades for doubly-assessed exams (June 2013)*

Referred to 76 Econ. History final exams from Valencia-Alzira Regional Center (June 2013)

• We found the same

• The differential was in an average of 0,9 points over 8

• Final grade differed substantially (21%) not obtain a passing grade

Correlations between grades assigned by examiners

2012 2013

n 20 76

GLOBAL GRADE 0,82 0,88

SORT QUESTIONS 0,85 0,87

TEXT COMMENTARY 0,70 0,67

Despite these differences between examiners we found:• A high correlation on the global score and short questions• Lower correlation on text commentary grades

Comparing how tutors and G-Rubrics marks TMA

• Grubric could cope simultaneously with both problems:

• Inter-examiners variability

• Intra-examiner reliability

• A fragment of "the Wealth of Nations", by Adam Smith, was selected to be

commented by students.

• A rubric was build to minimise inter-examiners variability.

• A G-Rubric's object, similar to those above described, was designed and their

axes were aligned with the rubric used by tutors to mark the students'

assignments

• The tutors graded these assignments using the rubric

• The teaching team used GRubric to grade the students' TMA again

• 252 TMAs were double-graded to compare G-Rubric and Tutors marks

Our first step has been to compare how tutors and G-Rubric grades TMAs

What have we found comparing grades given by tutors and GRubric?

2.- Pearson correlations between GRubric´s and tutor´s marksyielded a large effect size (.549**).

M SD Min Max

Tutor’s

Marks

5.95 1.45 1.55 8.54

GRubric

Marks

5.92 1.61 2.13 9,20

Main Descriptives of Tutors and GRubric marks (N=252)

An independent samples t test yielded no significant differences between the means of

Tutors and GRrubric marks, t(251), p=.720, ns **. The correlation is significant at the 0.01

level (bilateral)

1. - No significative difference between means.

Grades distributions: analysis of frequencies

0,79

4,37

6,75 6,35

30,56

28,57

14,68

7,94

4,76

9,92

15,08

17,06

22,6221,83

7,94

0,79

0

5

10

15

20

25

30

35

0 a 1 1 a 2 2 a 3 3 a 4 4 a 5 5 a 6 6 a 7 7 a 8 8 a 9 9 a 10

3.- G-Rubric’s marks were more homogeneously distributed in

comparison with the higher concentration of the Tutors’ marks in the ranges between 5 and 7 points

Tutors grades Grubric’s grades

Points ranges

Perc

enta

ges

of

grad

es in

to e

ach

ran

ge

Analysis of the homogeneity of G-Rubric and tutor’s marks

Tutor Mark GRubric Mark

Mark

Difference

Chi-

cuadrado

69,14 47,21 74,49

gl 36 36 36

p ,001 ,100 ,000

Kruskal-Wallis analyses for the evaluation of Marks homogeneity between the 37 tutoring groups

4.- Tutors’ marks presented a significant inter-group variability,

as well as mark difference.

On the contrary, G-Rubric marks did not differ significantly between

these same tutorial groups, proving, thus, its higher levels of homogeneity.

CONCLUSIONS

Main conclusions

• Automated-assessment software such as G-Rubric is currently mature enough to be used with students.

• The kind of feedback offered was useful to improve the students’ performance

• Results in terms of students’ satisfaction are also encouraging.

• For teachers, the time and effort required is affordable.

• A remarkable correlation and no significant differences between the means has been found.

• Tutors’ scores presented a significant inter-group variability

• On the contrary, G-Rubric’s marks did not differ significantly between these same tutorial groups, proving, thus, its higher levels of homogeneity

Our proposal:

The students’ essays will be grade first using G-Rubric, afterward tutors will grade again to validate or modify the grades given.

Regarding how Grubric could support grading

Download page

http://www.elsemantico.es/gallito20/download-eng.html

ReferencesCascón, L., & Antonio, J. (1989). Comprensión y memoria de textos expositivos: diferencias entre sujetos expertos y novatos. Recuperado a partir de https://repositorio.uam.es/handle/10486/4362Forsman, S. (1985). Writing to learn means learning to think. Roots in the Sawdust, 162–174.Hernández, M., & Santamaría Lancho, M. (s. f.). G-Rubric: una aplicación para corrección automática de preguntas abiertas. Primer balance de su utilización. G-Rubric:

an application for automatic assessment of free-text questions: first outcome analysis. Recuperado a partir de http://www.xiiedhe.unican.es/wp-content/uploads/2016/04/hernandezsantamaria.pdfJorge Botana, G. (2010). La técnica del análisis de la Semántica Latente (LSA/LSI) como modelo informático de la comprensión del texto y el discurso una aproximación

distribuida al análisis semántico. Universidad Autónoma de Madrid. Recuperado a partir de https://dialnet.unirioja.es/servlet/tesis?codigo=27624Jorge-Botana, G., Leon, J. A., Olmos, R., & Escudero, I. (2010). Latent semantic analysis parameters for essay evaluation using small-scale corpora*. Journal of

Quantitative Linguistics, 17(1), 1–29.Jorge-Botana, G., León, J. A., Olmos, R., & Hassan-Montero, Y. (2010). Visualizing polysemy using LSA and the predication algorithm. Journal of the American Society

for Information Science and Technology, 61(8), 1706–1724.Jorge-Botana, G., Olmos, R., & Barroso, A. (2012). The Construction-Integration framework: a means to diminish bias in LSA-based call routing. International Journal

of Speech Technology, 15(2), 151–164.Jorge-Botana, G., Olmos, R., & Barroso, A. (2013). Gallito 2.0: A natural language processing tool to support research on discourse. En Proceedings of the 13th Annual

Meeting of the Society for Text and Discourse. Recuperado a partir de http://elsemantico.es/Documentos/Gallito2_Valencia_new.pdfJorge-Botana, G., Olmos, R., & León, J. A. (2009). Using latent semantic analysis and the predication algorithm to improve extraction of meanings from a diagnostic

corpus. The Spanish journal of psychology, 12(02), 424–440.Julià, J. M. (1999). Aprendizaje a través de la escritura. Actas de las V Jornadas de Enseñanza Universitaria de Informática, Jenui, 99, 205–210.

Olmos, R., Jorge-Botana, G., León, J. A., & Escudero, I. (2014). Transforming selected concepts into dimensions in latent semantic analysis. Discourse Processes, 51(5-6), 494–510.Olmos, R., León, J. A., Escudero, I., & Jorge-Botana, G. (2009). Análisis del tamaño y especificidad de los corpus en la evaluación de resúmenes mediante el LSA: Un

análisis comparativo entre LSA y jueces expertos. Revista signos, 42(69), 71–81.Olmos, R., León, J. A., Escudero, I., & Jorge-Botana, G. (2011). Using latent semantic analysis to grade brief summaries: some proposals. International Journal of

Continuing Engineering Education and Life Long Learning, 21(2-3), 192–209.Olmos, R., León, J. A., Jorge-Botana, G., & Escudero, I. (2009). New algorithms assessing short summaries in expository texts using latent semantic analysis. Behavior

Research Methods, 41(3), 944–950.Parker, R. P., & Goodkin, V. (1987). The Consequences of Writing: Enhancing Learning in the Disciplines. ERIC. Recuperado a partir de http://eric.ed.gov/?id=ED272928Roscoe, R. D., Allen, L. K., Weston, J. L., Crossley, S. A., & McNamara, D. S. (2014a). The Writing Pal intelligent tutoring system: Usability testing and development.

Computers and Composition, 34, 39–59.Roscoe, R. D., Allen, L. K., Weston, J. L., Crossley, S. A., & McNamara, D. S. (2014b). The Writing Pal intelligent tutoring system: Usability testing and development.

Computers and Composition, 34, 39–59.Roscoe, R. D., Brandon, R. D., Snow, E. L., & McNamara, D. S. (2013). Game-based writing strategy practice with the Writing Pal. Exploring technology for writing and

writing instruction, 1–20

Thanks you for your attention