good question! statistical ranking for question generation michael heilman and noah a. smith the...

31
Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational Linguistics - Human Language Te chnologies (NAACL HLT 2010)

Upload: marvin-underwood

Post on 16-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Good Question! Statistical Ranking for Question Generation

Michael Heilman and Noah A. SmithThe North American Chapter of Association for Computational Linguistics - Human Language Technologies (NAACL

HLT 2010)

Page 2: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Agenda

• Introduction• Related Work• Three-stage Framework AQG• Evaluation• Conclusion• Comments

Page 3: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Introduction(1/3)

• In this paper, we focus on question generation (QG) for the creation of educational materials for reading practice and assessment.

• Our goal is to generate fact-based questions about the content of a given article.

• The top-ranked questions could be filtered and revised by educators, or given directly to students for practice.

• Here we restrict our investigation to questions about factual information in texts.

Page 4: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Introduction(2/3)

• Consider the following sentence from theWikipedia article on the history of Los Angeles

During the Gold Rush years in northern California, Los Angeles became known as the “Queen of the Cow Counties” for its role in supplying beef and other foodstuffs to hungry miners in the north.

What did Los Angeles become known as the “Queen of the Cow Counties” for?

Los Angeles become known as the “Queen of the Cow Counties” for its

role in supplying beef and other foodstuffs to hungry miners in the

north

Page 5: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Introduction(3/3)

• Question transformation involves complex long distance dependencies.

• The characteristics of such phenomena are difficult to learn from corpora, but they have been studied extensively in linguistics.

• However, since many phenomena pertaining to question generation are not so easily encoded with rules, we include statistical ranking as an integral component.

• Thus, we employ an overgenerate-andrank approach.

Page 6: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Related Work

• None has involved statistical models for choosing among output candidates.

• Mitkov et al. (2006) demonstrated that automatic generation and manual correction of questions can be more time-efficient than manual authoring alone.

• Existing QG systems model their transformations from source text to questions with many complex rules for specific question types (e.g., a rule for creating a question Who did the Subject Verb? from a sentence with SVO word order and an object referring to a person), rather than with sets of general rules.

Page 7: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Research Objectives

• We apply statistical ranking to the task of generating natural language questions.

• We model QG as a two-step process of first simplifying declarative input sentences and then transforming them into questions.

• We incorporate linguistic knowledge to explicitly model well-studied phenomena related to long distance dependencies in WH questions.

• We develop a QG evaluation methodology, including the use of broad-domain corpora.

Page 8: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Three-stage Framework AQG

• We define a framework for generating a ranked set of fact-based questions about the text of a given article.

• From this set, the top-ranked questions might be given to an educator for filtering and revision, or perhaps directly to a student for practice.

Page 9: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Stage 1 Transforming Source Sentence

• Each of the sentences from the source text is expanded into a set of derived declarative sentences by altering lexical items, syntactic structure, and semantics.

• In our implementation, a set of transformations derive a simpler form of the source sentence by removing phrase types such as leading conjunctions, sentence-level modifying phrases, and appositives.

Page 10: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Stage 1 Transforming Source Sentence

• Complex source sentence:

• Extracted factual sentences:

Prime Minister Vladimir V. Putin, the country's paramount leader, cut short a trip to Siberia, returning to Moscow to oversee the federal response.

•Prime Minister Vladimir V. Putin cut short a trip to Siberia.•Prime Minister Vladimir V. Putin was the country's paramount leader.•Prime Minister Vladimir V. Putin returned to Moscow to oversee the federal response.

Page 11: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Stage 2 Question Transducer

• The declarative sentences derived in step 1 are transformed into sets of questions by a sequence of well-defined syntactic and lexical transformations (subject-auxiliary inversion, WH-movement, etc.).

• It identifies the answer phrases which may be targets for WH-movement and converts them into question phrases.

Declarative Sentence Mark UnmovablePhrases

Generate PossibleQuestion Phrase *

(Decompose MainVerb)

(Invert Subjectand Auxiliary)

Insert Question Phrase

PerformPost-processing

Question

Page 12: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Stage 2 Question Transducer

• In English, various constraints determine whether phrases can be involved in WH-movement and other phenomena involving long distance dependencies.

• For example, noun phrases are “islands” to movement, meaning that constituents dominated by a noun phrase typically cannot undergo WH-movement.

Mark UnmovablePhrases

Generate PossibleQuestion Phrase *

(Decompose MainVerb)

(Invert Subjectand Auxiliary)

Insert Question Phrase

PerformPost-processing

Question

Declarative Sentence

John liked the book that I gave him.

What did John like? *Who did John like the book that gave him?

Page 13: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Stage 2 Question Transducer

• After marking unmovable phrases, we iteratively remove each possible answer phrase.

• The question phrases for a given answer phrase consist of a question word (e.g., who, what, where, when), possibly preceded by a preposition and, in the case of question phrase like whose car, followed by the head of the answer phrase.

Mark UnmovablePhrases

Generate PossibleQuestion Phrase *

(Decompose MainVerb)

(Invert Subjectand Auxiliary)

Insert Question Phrase

PerformPost-processing

Question

Declarative Sentence

Page 14: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Stage 2 Question Transducer

• The system annotates the source sentence with a set of entity types taken from the BBN Identifinder Text Suite and generate a final question.

• The set of labels from BBN includes those used in standard named entity recognition tasks (e.g., “PERSON,” “ORGANIZATION” and their corresponding types for common nouns (e.g., “PER DESC,” “ORG DESC”).Mark Unmovable

PhrasesGenerate PossibleQuestion Phrase *

(Decompose MainVerb)

(Invert Subjectand Auxiliary)

Insert Question Phrase

PerformPost-processing

Question

Declarative Sentence

Page 15: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Stage 2 Question Transducer

• It also includes dates, times, monetary units, and others.

• For a given answer phrase, the system uses the phrase’s entity labels and syntactic structure to generate a set of zero or more possible question phrases, each of which is used to generate a final question sentence.

Mark UnmovablePhrases

Generate PossibleQuestion Phrase *

(Decompose MainVerb)

(Invert Subjectand Auxiliary)

Insert Question Phrase

PerformPost-processing

Question

Declarative Sentence

Page 16: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Stage 2 Question Transducer

• In order to perform subject-auxiliary inversion– if an auxiliary verb or modal is not present,

the question transducer decomposes the main verb into the appropriate form of do and the base form of the main verb.

– If an auxiliary verb is already present, however, this decomposition is not necessary.

Mark UnmovablePhrases

Generate PossibleQuestion Phrase *

(Decompose MainVerb)

(Invert Subjectand Auxiliary)

Insert Question Phrase

PerformPost-processing

Question

Declarative Sentence

John saw Mary. → John did see Mary. → Who did John see?

John has seen Mary. → Who has John seen?

Page 17: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Stage 2 Question Transducer

• In order to convert between lemmas of verbs and the different surface forms that correspond to different parts of speech, we created a map from pairs of verb lemma and part of speech to verb surface forms.

• We extracted all verbs and their parts of speech from the Penn Treebank.

• We lemmatized each verb first by checking morphological variants in WordNet, and if a lemma was not found, then trimming the rightmost characters from the verb one at a time until a matching entry in WordNet was found.

Mark UnmovablePhrases

Generate PossibleQuestion Phrase *

(Decompose MainVerb)

(Invert Subjectand Auxiliary)

Insert Question Phrase

PerformPost-processing

Question

Declarative Sentence Mark UnmovablePhrases

Generate PossibleQuestion Phrase *

(Decompose MainVerb)

(Invert Subjectand Auxiliary)

Insert Question Phrase

PerformPost-processing

Question

Declarative Sentence

Page 18: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Mark UnmovablePhrases

Generate PossibleQuestion Phrase *

(Decompose MainVerb)

(Invert Subjectand Auxiliary)

Insert Question Phrase

PerformPost-processing

Question

Declarative Sentence

Stage 2 Question Transducer

• The transducer performs subject-auxiliary inversion either when the question to be generated is a yes-no question or when the answer phrase is a non-subject noun phrase.

• Each possible question phrase is inserted into a copy of the tree to produce a question.

Mark UnmovablePhrases

Generate PossibleQuestion Phrase *

(Decompose MainVerb)

(Invert Subjectand Auxiliary)

Insert Question Phrase

PerformPost-processing

Question

Declarative Sentence

Page 19: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Stage 2 Question Transducer

• Sentence-final periods are changed to question marks.

• The output of our system that nearly all of the questions including pronouns were too vague (e.g., What does it have as a head of state?).

• Therefore, to filter all questions with personal pronouns, possessive pronouns, and noun phrases consisting solely of determiners (e.g., those).

Mark UnmovablePhrases

Generate PossibleQuestion Phrase *

(Decompose MainVerb)

(Invert Subjectand Auxiliary)

Insert Question Phrase

PerformPost-processing

Question

Declarative Sentence

Page 20: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Stage 3 Question Ranker

• Since different sentences and transformations of source sentences, may be more or less likely to lead to high-quality questions.

• Fifteen native English-speaking university students rated a set of questions produced from stages 1 and 2.

• For a predefined training set, each question was rated by a single annotator (not the same for each question), leading to a large number of diverse examples.

Page 21: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Stage 3 Question Ranker

• For the test set, each question was rated by three people (again, not the same for each question) to provide a more reliable gold standard.

• An inter-rater agreement of Fleiss’s k = 0.42 was computed from the test set’s acceptability ratings.

Source Training set Testing set

English Wikipedia 1328/12 120/2

Simple English Wiki 1195/16 118/2

Wall Street Journal 284/8 190/2

Total 2807/36 428/6

Page 22: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Ranking

• Why do we over-generate and rank questions?– Name entity recognition error– Parsing error– Transformation error

• Therefore, We use a discriminative ranker specifically based on a logistic regression model that defines a probability of acceptability.

M. Collins. 2000. Discriminative reranking for natural language parsing. In Proc. of ICML.

Page 23: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Feature Set

Type Feature Value Type

Length the numbers of tokens in the question, the source sentence, and the answer phrase from which the WH phrase was generated

integer

Negation the presence of not, never, or no in the question

boolean

N-Gram Language

Model

the log likelihoodsand length-normalized log likelihoods ofthe question, the source sentence, and the answerphrase

real value

Page 24: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Type Feature Value Type

Grammatical the numbers of proper nouns, pronouns, adjectives, adverbs, conjunctions, numbers, noun phrases, prepositional phrases, and subordinate clauses in the phrase structure parse trees for the question and answer phrase

integer

Transformations the possible syntactic transformations(e.g., removal of appositives and parentheticals, choosing the subject of source sentence as the answer phrase)

binary

Vagueness the numbers of noun phrases in the question,source sentence, and answer phrase that arepotentially vague

integer

Page 25: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Evaluation

• The results of experiments to evaluate the quality of generated questions before and after ranking.

• The evaluation metric we employ is the percentage of test set questions labeled as acceptable.

• For rankings, our metric is the percentage of the top N% labeled as acceptable, for various N.

Page 26: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Results for Unranked Questions

• 27.3% of test set questions were labeled acceptable (i.e., having no deficiencies) by a majority of raters.

Page 27: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational
Page 28: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Results for Ranking

Page 29: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Ablation Result

Page 30: Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational

Recall