course 7 – 05 april 2012 diana trandabă [email protected] 1

42
Applications of Natural Language Processing Course 7 – 05 April 2012 Diana Trandabăț [email protected] 1

Upload: vivian-byrd

Post on 15-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

1

Applications of Natural Language

ProcessingCourse 7 – 05 April 2012

Diana Trandabăț[email protected]

Page 2: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

2

NLP in eLearning◦Generating test questions◦Keywords identification◦Extraction of definitions

Content

Page 3: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

3

eLearning comprises all forms of electronically supported learning and teaching.

eLearning 2.0 - with the emergence of Web 2.0  Conventional e-learning systems were based on

instructional packets, which were delivered to students using assignments. Assignments were evaluated by the teacher.

In contrast, the new e-learning places increased emphasis on social learning and use of social software such as blogs, wikis, podcasts etc.

eLearning

Page 4: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

4

NLP techniques in educational applications working with textual data:◦ intelligent tutoring systems◦ automatic generation of exercises◦ assessment of learner generated discourse◦ reading and writting assistance

These applications require an adaptation of NLP techniques to various types of discourse, e.g. tutoring dialogues, which are different from typical task-oriented spoken dialogue systems.

Moreover, educational applications place strong requirements on NLP systems, which have to be robust yet accurate.

NLP in eLearning

Page 5: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

5

eLearning

Educational Natural Language Processing

NLP

Computer assisted learning/instruction

Analysis and use of language by machines

Page 6: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

6

Definition:◦ Field of research exploring the use of NLP

techniques in educational contexts Why?

◦ Large text repositories with user generated discourse and user generated metadata are created

◦ These repositories need advanced information management and NLP to be efficiently accessed

◦ Using these repositories to create structured knowledge bases can improve NLP

Educational NLP

Page 7: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

7

Definition: All forms of assessment delivered with the help of computers

also called Computer Assisted/Aided Assessment (CAA)

Adequate question types for CAA (McKenna & Bull, 1999):◦ Multiple choice questions (MCQs)◦ True/False questions◦ Matching questions◦ Ranking questions◦ Sequencing questions◦ etc.

Computer-based Testing

Page 8: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

8

Generation of questions and exercises◦ Writing test questions, especially objective test

items, is an extremely difficult and time consuming task for teachers

◦ Use of NLP to automatically generate objective test items, esp. for language learning

Assessment and evaluation of answers to subjective test items◦ Use of NLP to automatically:

Diagnose errors in short-answer essays Grade essays

NLP for Computer Assisted Assessment

Page 9: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

9

Source data◦ Corpora: texts should be chosen according to

the learner model (level, mastered vocabulary) the instructor model (target language, word category)

◦ Lexical semantic resources, e.g. WordNet Tools

◦ Tokeniser and sentence splitter◦ Lemmatiser◦ Conjugation and declension tools◦ POS tagger◦ Parser and chunker

Automatic Generation of Test Items

Page 10: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

10

Choose the correct answer among a set of possible answers:◦ Who was voted the best international footballer

for 2004?

(a) Henry(b) Beckham(c) Ronaldinho(d) Ronaldo Usually 3 to 5 alternative answers

Multiple-Choice Questions

Question focus

Distractors

Correct answer / Key

Page 11: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

11

Distractors (also distracters) are the incorrect answers presented as a choice in a multiple-choice test◦ Challenge: Generation of "good" distractors

Ensure that there is only one correct response for single response MCQ

The key should not always occur at the same position in the list of answers

Distractors should be grammatically parallel with each other and approximately equal in length

Distractors should be plausible and attractive However, distractors should not be too close to the

correct answer and risk confusing students

Distractors

Page 12: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

12

1. Selection of the key Unknown words that appear in a reading Domain-specific terms2. Generation of the question focus Constrained patterns Transformation of source clauses to

question focuses.Transitive verbs require objects → Which kind

of verbs require objects?

Multiple-Choice Questions

Page 13: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

13

3, Generation of the distractors WordNet concepts which are semantically close to the

key, e.g. hypernyms and co-hyponyms◦ "Which part of speech serves as the most central element in a

clause?"◦ Key: "verb", ◦ Distractors: "noun", "adjective", "preposition“

Same POS Similar frequency range For grammar questions, use a declension or a

conjugation tool to generate different forms of the key, e.g. change case, number, person, mode, tense, etc.

Common student errors in the given context Collocations: frequent co-occurrence with either the left

or the right context

Multiple-Choice Questions

Page 14: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

14

Consists of a portion of text with certain words removed

The student is asked to "fill in the blanks“ Challenges:

◦ Phrase the question so that only one correct answer is possible (e.g. verb to be conjugated)

Fill-in-the-Blank Questions

Page 15: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

15

1. Selection of an input corpus 2. POS tagging 3. Selection of the blanks in the input corpus

◦ Every "n-th" (e.g. fifth or eighth) word in the text◦ Words in specified frequency ranges, e.g. only high

frequency or low frequency words◦ Words belonging to a given grammatical category◦ Open-class words, given their POS◦ Machine learning, based on a pool of input questions used

as training data 4. Where needed, provide some information about

the word in the blank, e.g. verb lemma when the test targets verb conjugation

Fill-in-the-Blank Question Generation

Page 16: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

16

Short answer assessment◦ Learner's response, one +

target responses, question, source reading passage

◦ Linguistic analysis: annotation, alignment, diagnosis

Essays Plagiarism detection Speech generation

Overview on assessment of learner generated data

Page 17: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

17

Related techniques: summarisation and sentence compression

Syntactic simplification:◦ Removal or replacement of difficult syntactic

structures, using hand-built transformational rules applied to dependency and parse trees

Lexical simplification:◦ Replace difficult words with simpler ones◦ Difficult words are identified using the number of

syllables and/or frequency counts in a corpus◦ Choose the simplest synonym for difficult words in

WordNet

Automatic Text Simplification

Page 18: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

18

Overall goal: support vocabulary acquisition during reading for:◦ children, who learn to read◦ foreign language learners, who read texts in a

foreign language Problem: a word's context may not provide

enough information about its meaning Solution: augment documents with

dynamically generated annotations about (problematic) words

Vocabulary Assistance for Reading

Page 19: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

19

A grammar is created for the automatic identification of definitions in texts

Types of definitions “is_def” – “HTML este tot un protocol folosit de

World Wide Web.” (HTML is also a protocol used by World Wide Web).

“verb_def” – “Poşta electronică reprezintă transmisia mesajelor prin intermediul unor reţele electronice.” (Electronic mail represents sending messages through electronic networks).

“punct_def” – “Bit – prescurtarea pentru binary digit” (Bit – shortcut for binary digit)

Automatic detection of definitions

Page 20: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

Types of definitions layout_def

“pron_def” – “…definirii conceptului de baze de date. Acesta descrie metode de ….” (…defining the database concept. It describes methods of ….)

“other_def” – “triunghi echilateral, adică cu toate laturile egale” (equilateral triangle i.e. having all sides equal).

Ro:

Organizarea datelor

Cel mai simplu mod de organizare este cel secvenţial.

En:

Data organizing The simplest method is the sequential one.

Page 21: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

Distribution of the definitions

Type Manual % Automatic %

is_def 70 33.8 204 32.8

verb_def 116 56.0 272 43.8

punct_def 15 7.2 124 20.0

layout_def 2 1.0 21 3.4

pron_def 4 2.0 0 0.0

Total 207 621

Page 22: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

Rules Simple grammar rules Composed grammar rules

“is_def” grammar rule:<rule name="may_be_term">

<seq> <query match="tok[@base='fi' and

substring(@ctag,1,5)='vmip3']"/>

<first> <ref name="UndefNominal" />

<ref name="DefNominal" />

</first></seq>

</rule>

Page 23: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

Evaluation

Definition Type Resultis_def Sentence-level matching:

P: 0.5366, R: 1.0, F2: 0.7765 Token-level matching:P: 0.0648, R: 0.3328, F2: 0.14

verb_def Sentence-level matching P: 0.7561, R: 1.0, F2: 0.9029 Token-level matchingP: 0.0471, R: 0.1422, F2: 0.085

punct_def Sentence-level matching P: 0.1463, R: 1.0, F2: 0.3396 Token-level matching P: 0.0025, R: 0.1163, F2: 0.0072

layout_def Sentence-level matching P: 0.0488, R: 1.0, F2: 0.1333 Token-level matching P: 0.0007, R: 0.1020, F2: 0.0022

Lxtransduce (Tobin 2005) is used to match the grammar in files

Page 24: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

Question Answering Accordingly to the answer type, we have

the following type of questions (Harabagiu, Moldovan 2007):

◦ Factoid – “Who discovered the oxygen?” or “When did Hawaii become a state?” or “What football team won the World Coup in 1992?”.

◦ List – “What countries export oil?” or “What are the regions preferred by the Americans for holidays?”.

◦ Definition – “What is quasar?” or “What is a question-answering system?”

Page 25: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

QA – Example Question: Cine este Zeus? (Cine, zeus, PERSON)

Snippet: 0026#10014#1.0#Zeus#Zeus\zeus\NP este\fi\V3\ cel\cel\TSR\ mai\mai\R\ puternic\puternic\ASN\ dintre\dintre\S\ olimpieni\olimpieni\NPN\ ,\,\COMMA\ socotit\socoti\VP\ drept\drept\S\ stăpânul\stăpân\NSRY\ suprem\suprem\ASN\ al\al\TS\ oamenilor\om\NPOY\ şi\şi\CR\ al\al\TS\ zeilor\zeu\NPOY\ .\.\PERIOD\

Our pattern for “is_def” (\zeus\.*\NP .*\fi\V3\ (.*)) match the snippet

Page 26: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

Keywords extraction Using a trening corpus of documents

annotated with keywords Measuring distribution of manually marked

keywords over documents

Page 27: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

# of annotated documents

Average length (# of tokens)

Bulgarian 55 3980Czech 465 672Dutch 72 6912English 36 9707German 34 8201Polish 25 4432Portuguese 29 8438Romanian 41 3375

Page 28: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

# of keywords Average # of keywords per doc.

Bulgarian 3236 77Czech 1640 3.5Dutch 1706 24English 1174 26German 1344 39.5Polish 1033 41Portuguese 997 34Romanian 2555 62

Page 29: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

Reflection Did the human annotators annotate

keywords of domain terms? Was the task adequately contextualised?

Page 30: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

Keyword extraction

Good keywords have a typical, non random distribution in and across documents

Keywords tend to appear more often at certain places in texts (headings etc.)

Keywords are often highlighted / emphasised by authors

Keywords express / represent the topic(s) of a text

Page 31: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

Modelling Keywordiness Linguistic filtering of KW candidates, based

on part of speech and morphology Distributional measures are used to identify

unevenly distributed words◦TFIDF

Knowledge of text structure used to identify salient regions (e.g., headings)

Layout features of texts used to identify emphasised words and weight them higher

Finding chains of semantically related words

Page 32: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

Challenges

Treating multi word keywords Assigning a combined weight which takes

into account all the aforementioned factors

Multilinguality: finding good settings for all languages, balancing language dependent and language independent features

Page 33: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

Treatment of keyphrases Keyphrases have to be restricted wrt to

length (max 3 words) and frequency (min 2 occurrences)

Keyphrase patterns must be restricted wrt to linguistic categories (style of learning is acceptable; of learning styles is not)

Page 34: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

KWE Evaluation Human annotators marked n keywords in

document d First n choices of KWE for document d

extracted Measure overlap between both sets measure also partial matches

Page 35: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

35

Resources:◦ Lexical semantic resources, e.g. WordNet◦ Web 2.0 resources, e.g. Wikipedia, Wiktionary

Tools:◦ Tokeniser and sentence splitting◦ Morphological analysis◦ Part of speech tagging◦ Parsing and chunking◦ Word sense disambiguation◦ Summarisation◦ Keyword extraction

NLP has lots to offer

Page 36: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

36

To assist instructors◦ Automatic generation of questions and exercises◦ Assessment of learner-generated discourse

To assist learners◦ Reading and writing assistance◦ Electronic career guidance◦ Educational question answering

For all users in the Web 2.0◦ NLP for wikis◦ Quality assessment of user generated contents

Tasks and applications

Page 37: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

37

Computer-Assisted Language Learning Intelligent Tutoring Systems Information search for eLearning Educational blogging Annotations and social tagging Analysing collaborative learning processes

automatically Learners' corpora and resources eLearning standards, e.g. SCORM

A lot more research is done on:

Page 38: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

38

1a) Extract definitions from a given Wikipedia page 1b) Generate questions such as “what is …" or “what is

the meaning of …" from the list above

2) Automatic generation of “fill the blanks” questions Dacă nu ai nimic planificat diseară, hai __ teatru. (a) la (b) de (c) pentru (d) null

◦ Input: a sentence and the key Dacă nu ai nimic planificat diseară, hai la teatru. Key: la

◦ Output: generate three distractors using different approaches: baseline: word frequencies Collocations "creative" method, devised by the students

Requirements (Team: max 2 persons, Deadline: 12 April)

Page 39: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

Further reading Jill Burstein: Opportunities for Natural Language

Processing Research in Education ProceedingCICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing, Springer-Verlag Berlin, Heidelberg, 2009 

Paola Monachesi, Eline Westerhout. What can NLP techniques do for eLearning? Presented at INFOS 2008, 27-29 March.

Adrian Iftene, Diana Trandabăţ, Ionuţ Pistol: Grammar-based Automatic Extraction of Definitions and Applications for Romanian. RANLP 2007 workshop: Natural Language Processing and Knowledge Representation for eLearning Environments. 

39

Page 41: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

41

Thanks!

Page 42: Course 7 – 05 April 2012 Diana Trandabă dtrandabat@info.uaic.ro 1

42

(1) Plagiarism of authorship: the direct case of putting your own name to someone else’s work

(2) Word-for-word plagiarism: copying of phrases or passages from published text without quotation or acknowledgement.

(3) Paraphrasing plagiarism: words or syntax are changed (rewritten), but the source text can still be recognized.

(4) Plagiarism of the form of a source: the structure of an argument in a source is copied (verbatim or rewritten)

(5) Plagiarism of ideas: the reuse of an original thought from a source text without dependence on the words or form of the source

(6) Plagiarism of secondary sources: original sources are referenced or quoted, but obtained from a secondary source text without looking up the original.

Types of plagiarism