good question! statistical ranking for question generation michael heilman and noah a. smith the...

Good Question! Statistical Ranking for Question Generation

Michael Heilman and Noah A. SmithThe North American Chapter of Association for Computational Linguistics - Human Language Technologies (NAACL

HLT 2010)

Agenda

• Introduction• Related Work• Three-stage Framework AQG• Evaluation• Conclusion• Comments

Introduction(1/3)

• In this paper, we focus on question generation (QG) for the creation of educational materials for reading practice and assessment.

• Our goal is to generate fact-based questions about the content of a given article.

• The top-ranked questions could be filtered and revised by educators, or given directly to students for practice.

• Here we restrict our investigation to questions about factual information in texts.

Introduction(2/3)

• Consider the following sentence from theWikipedia article on the history of Los Angeles

During the Gold Rush years in northern California, Los Angeles became known as the “Queen of the Cow Counties” for its role in supplying beef and other foodstuffs to hungry miners in the north.

What did Los Angeles become known as the “Queen of the Cow Counties” for?

Los Angeles become known as the “Queen of the Cow Counties” for its

role in supplying beef and other foodstuffs to hungry miners in the

north

Introduction(3/3)

• Question transformation involves complex long distance dependencies.

• The characteristics of such phenomena are difficult to learn from corpora, but they have been studied extensively in linguistics.

• However, since many phenomena pertaining to question generation are not so easily encoded with rules, we include statistical ranking as an integral component.

• Thus, we employ an overgenerate-andrank approach.

Related Work

• None has involved statistical models for choosing among output candidates.

• Mitkov et al. (2006) demonstrated that automatic generation and manual correction of questions can be more time-efficient than manual authoring alone.

• Existing QG systems model their transformations from source text to questions with many complex rules for specific question types (e.g., a rule for creating a question Who did the Subject Verb? from a sentence with SVO word order and an object referring to a person), rather than with sets of general rules.

Research Objectives

• We apply statistical ranking to the task of generating natural language questions.

• We model QG as a two-step process of first simplifying declarative input sentences and then transforming them into questions.

• We incorporate linguistic knowledge to explicitly model well-studied phenomena related to long distance dependencies in WH questions.

• We develop a QG evaluation methodology, including the use of broad-domain corpora.

Three-stage Framework AQG

• We define a framework for generating a ranked set of fact-based questions about the text of a given article.

• From this set, the top-ranked questions might be given to an educator for filtering and revision, or perhaps directly to a student for practice.

Stage 1 Transforming Source Sentence

• Each of the sentences from the source text is expanded into a set of derived declarative sentences by altering lexical items, syntactic structure, and semantics.

• In our implementation, a set of transformations derive a simpler form of the source sentence by removing phrase types such as leading conjunctions, sentence-level modifying phrases, and appositives.

Stage 1 Transforming Source Sentence

• Complex source sentence:

• Extracted factual sentences:

Prime Minister Vladimir V. Putin, the country's paramount leader, cut short a trip to Siberia, returning to Moscow to oversee the federal response.

•Prime Minister Vladimir V. Putin cut short a trip to Siberia.•Prime Minister Vladimir V. Putin was the country's paramount leader.•Prime Minister Vladimir V. Putin returned to Moscow to oversee the federal response.

Stage 2 Question Transducer

• The declarative sentences derived in step 1 are transformed into sets of questions by a sequence of well-defined syntactic and lexical transformations (subject-auxiliary inversion, WH-movement, etc.).

• It identifies the answer phrases which may be targets for WH-movement and converts them into question phrases.

Declarative Sentence Mark UnmovablePhrases

Generate PossibleQuestion Phrase *

(Decompose MainVerb)

(Invert Subjectand Auxiliary)

Insert Question Phrase

PerformPost-processing

Question


• In English, various constraints determine whether phrases can be involved in WH-movement and other phenomena involving long distance dependencies.

• For example, noun phrases are “islands” to movement, meaning that constituents dominated by a noun phrase typically cannot undergo WH-movement.

Mark UnmovablePhrases






Question

Declarative Sentence

John liked the book that I gave him.

What did John like? *Who did John like the book that gave him?


• After marking unmovable phrases, we iteratively remove each possible answer phrase.

• The question phrases for a given answer phrase consist of a question word (e.g., who, what, where, when), possibly preceded by a preposition and, in the case of question phrase like whose car, followed by the head of the answer phrase.







Question



• The system annotates the source sentence with a set of entity types taken from the BBN Identifinder Text Suite and generate a final question.

• The set of labels from BBN includes those used in standard named entity recognition tasks (e.g., “PERSON,” “ORGANIZATION” and their corresponding types for common nouns (e.g., “PER DESC,” “ORG DESC”).Mark Unmovable

PhrasesGenerate PossibleQuestion Phrase *





Question


http://www.bbn.com/technology/speech/identifinder




• It also includes dates, times, monetary units, and others.

• For a given answer phrase, the system uses the phrase’s entity labels and syntactic structure to generate a set of zero or more possible question phrases, each of which is used to generate a final question sentence.







Question



• In order to perform subject-auxiliary inversion– if an auxiliary verb or modal is not present,

the question transducer decomposes the main verb into the appropriate form of do and the base form of the main verb.

– If an auxiliary verb is already present, however, this decomposition is not necessary.







Question


John saw Mary. → John did see Mary. → Who did John see?

John has seen Mary. → Who has John seen?


• In order to convert between lemmas of verbs and the different surface forms that correspond to different parts of speech, we created a map from pairs of verb lemma and part of speech to verb surface forms.

• We extracted all verbs and their parts of speech from the Penn Treebank.

• We lemmatized each verb first by checking morphological variants in WordNet, and if a lemma was not found, then trimming the rightmost characters from the verb one at a time until a matching entry in WordNet was found.







Question

Declarative Sentence Mark UnmovablePhrases






Question


http://www.cis.upenn.edu/~treebank/







Question



• The transducer performs subject-auxiliary inversion either when the question to be generated is a yes-no question or when the answer phrase is a non-subject noun phrase.

• Each possible question phrase is inserted into a copy of the tree to produce a question.







Question



• Sentence-final periods are changed to question marks.

• The output of our system that nearly all of the questions including pronouns were too vague (e.g., What does it have as a head of state?).

• Therefore, to filter all questions with personal pronouns, possessive pronouns, and noun phrases consisting solely of determiners (e.g., those).







Question


Stage 3 Question Ranker

• Since different sentences and transformations of source sentences, may be more or less likely to lead to high-quality questions.

• Fifteen native English-speaking university students rated a set of questions produced from stages 1 and 2.

• For a predefined training set, each question was rated by a single annotator (not the same for each question), leading to a large number of diverse examples.

Stage 3 Question Ranker

• For the test set, each question was rated by three people (again, not the same for each question) to provide a more reliable gold standard.

• An inter-rater agreement of Fleiss’s k = 0.42 was computed from the test set’s acceptability ratings.

Source Training set Testing set

English Wikipedia 1328/12 120/2

Simple English Wiki 1195/16 118/2

Wall Street Journal 284/8 190/2

Total 2807/36 428/6

http://en.wikipedia.org/wiki/Fleiss'_kappa



Ranking

• Why do we over-generate and rank questions?– Name entity recognition error– Parsing error– Transformation error

• Therefore, We use a discriminative ranker specifically based on a logistic regression model that defines a probability of acceptability.

M. Collins. 2000. Discriminative reranking for natural language parsing. In Proc. of ICML.

Feature Set

Type Feature Value Type

Length the numbers of tokens in the question, the source sentence, and the answer phrase from which the WH phrase was generated

integer

Negation the presence of not, never, or no in the question

boolean

N-Gram Language

Model

the log likelihoodsand length-normalized log likelihoods ofthe question, the source sentence, and the answerphrase

real value

Type Feature Value Type

Grammatical the numbers of proper nouns, pronouns, adjectives, adverbs, conjunctions, numbers, noun phrases, prepositional phrases, and subordinate clauses in the phrase structure parse trees for the question and answer phrase

integer

Transformations the possible syntactic transformations(e.g., removal of appositives and parentheticals, choosing the subject of source sentence as the answer phrase)

binary

Vagueness the numbers of noun phrases in the question,source sentence, and answer phrase that arepotentially vague

integer

Evaluation

• The results of experiments to evaluate the quality of generated questions before and after ranking.

• The evaluation metric we employ is the percentage of test set questions labeled as acceptable.

• For rankings, our metric is the percentage of the top N% labeled as acceptable, for various N.

Results for Unranked Questions

• 27.3% of test set questions were labeled acceptable (i.e., having no deficiencies) by a majority of raters.

Results for Ranking

Ablation Result

Recall

Online Demo

http://www.ark.cs.cmu.edu/mheilman/questions/

good question! statistical ranking for question generation michael heilman and noah a. smith the...

Documents

wh questions

natural language questions

manual correction of

question generation

statistical ranking

source text

good question

given article