automatic evaluation of summaries using n-gram co-occurrence statistics
DESCRIPTION
Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics. By Chin-Yew Lin and Eduard Hovy. The Document Understanding Conference. In 2002 there were two main tasks Summarization of single-documents Summarization of Multiple-documents. DUC Single Document Summarization. - PowerPoint PPT PresentationTRANSCRIPT
Automatic Evaluation of Automatic Evaluation of Summaries Using N-gram Summaries Using N-gram Co-Occurrence StatisticsCo-Occurrence Statistics
By Chin-Yew Lin and Eduard By Chin-Yew Lin and Eduard HovyHovy
The Document Understanding The Document Understanding ConferenceConference
In 2002 there were two main tasksIn 2002 there were two main tasks Summarization of single-documentsSummarization of single-documents Summarization of Multiple-Summarization of Multiple-
documentsdocuments
DUC Single Document DUC Single Document SummarizationSummarization
Summarization of single-documentsSummarization of single-documents• Generate a 100 word summaryGenerate a 100 word summary• Training 30 sets of 10 docs each with Training 30 sets of 10 docs each with
100 word summaries100 word summaries• Test against 30 unseen documentsTest against 30 unseen documents
DUC Multi-Document DUC Multi-Document SummarizationSummarization
Summarization of multiple Summarization of multiple documents about a single subjectdocuments about a single subject• Generate 50,100,200,400 word Generate 50,100,200,400 word
summariessummaries• Four types: single natural disaster, Four types: single natural disaster,
single event, multiple instance of a type single event, multiple instance of a type of event, info about an individualof event, info about an individual
• Training: 30 sets of 10 documents with Training: 30 sets of 10 documents with their 50,100,200,400 word summariestheir 50,100,200,400 word summaries
• Test : 30 unseen documentsTest : 30 unseen documents
DUC Evaluation MaterialDUC Evaluation Material
For each document set, one human For each document set, one human summary was created to be the ‘Ideal’ summary was created to be the ‘Ideal’ summary for each length.summary for each length.
Two additional human summaries were Two additional human summaries were created at each lengthcreated at each length
Base line summaries were create Base line summaries were create automatically for each length as reference automatically for each length as reference pointspoints• Lead base line took first n-words of last Lead base line took first n-words of last
document for multi-doc taskdocument for multi-doc task• Coverage baseline used first sentence of each Coverage baseline used first sentence of each
doc until it reached its lengthdoc until it reached its length
SEE- Summary Evaluation SEE- Summary Evaluation EnvironmentEnvironment
A tool to allow assessors to compare A tool to allow assessors to compare system text (peer) with Ideal text (model). system text (peer) with Ideal text (model).
Can rank quality and content.Can rank quality and content. Assessor marks all system units sharing Assessor marks all system units sharing
content with model as {all,most,some, content with model as {all,most,some, hardly any}hardly any}
Assessor rate quality of grammaticality, Assessor rate quality of grammaticality, cohesion, and coherence {all, most, some, cohesion, and coherence {all, most, some, hardly any ,none}hardly any ,none}
SEE interfaceSEE interface
Making a JudgementMaking a Judgement
From Chin-Yew-Lin / MT summit IX 2003-09-27From Chin-Yew-Lin / MT summit IX 2003-09-27
Evaluation MetricsEvaluation Metrics
One idea is simple sentence recall, but it cannot One idea is simple sentence recall, but it cannot differentiate system performance (it pays to be differentiate system performance (it pays to be over productive)over productive)
Recall is measured relative to the model textRecall is measured relative to the model text E is average of coverage scoresE is average of coverage scores
Machine Translation and Machine Translation and Summarization EvaluationSummarization Evaluation
Machine TranslationMachine Translation• InputsInputs
Reference translationReference translation Candidate translationCandidate translation
• MethodsMethods Manually compare two Manually compare two
translations in:translations in:• AccuracyAccuracy• FluencyFluency• InformativenessInformativeness
Auto evaluation using:Auto evaluation using:• Blue/NIST scoresBlue/NIST scores
Auto SummarizationAuto Summarization• InputsInputs
Reference summaryReference summary Candidate summaryCandidate summary
• MethodsMethods Manually compare two Manually compare two
summaries in:summaries in:• Content OverlapContent Overlap• Linguistic QualitiesLinguistic Qualities
Auto Evaluation ?Auto Evaluation ?• ??
NIST BLEUNIST BLEU Goal: Measure the translation closeness between Goal: Measure the translation closeness between
a candidate translation and set of reference a candidate translation and set of reference translations with a numeric metrictranslations with a numeric metric
Method: use a weighted average of variable Method: use a weighted average of variable length n-gram matches between system length n-gram matches between system translation and the set of human reference translation and the set of human reference translationstranslations
BLEU correlates highly with human assessmentsBLEU correlates highly with human assessments Would like to make the same assumptions: the Would like to make the same assumptions: the
closer a summary is to a professional summary closer a summary is to a professional summary the better it isthe better it is
BLEU BLEU Is a promising automatic scoring metric for Is a promising automatic scoring metric for
summary evaluationsummary evaluation Basically a Basically a precisionprecision metric metric Measures how well a source overlaps a Measures how well a source overlaps a
model using n-gram co-occurrence model using n-gram co-occurrence statisticsstatistics
Uses a Brevity Penalty (BP) to prevent Uses a Brevity Penalty (BP) to prevent short translation that try to maximize their short translation that try to maximize their precision scoreprecision score
In formulas c = candidate length, r= In formulas c = candidate length, r= reference lengthreference length
Anatomy of BLEU Matching ScoreAnatomy of BLEU Matching Score
From Chin-Yew-Lin / MT summit IX 2003-09-27From Chin-Yew-Lin / MT summit IX 2003-09-27
ROUGE: Recall-Oriented ROUGE: Recall-Oriented Understudy for Gisting EvaluationUnderstudy for Gisting Evaluation
From Chin-Yew-Lin / MT summit IX 2003-09-27From Chin-Yew-Lin / MT summit IX 2003-09-27
What makes a good metric?What makes a good metric?
Automatic Evaluation should correlate Automatic Evaluation should correlate highly, positively, and consistently with highly, positively, and consistently with human assessments human assessments • If a human recognizes a good system, so will If a human recognizes a good system, so will
the metricthe metric The statistical significance of automatic The statistical significance of automatic
evaluations should be a good predictor of evaluations should be a good predictor of the statistical significance of human the statistical significance of human assessments with high reliabilityassessments with high reliability• The system can be used to assist in system The system can be used to assist in system
development in place of humansdevelopment in place of humans
ROUGE vs BLUEROUGE vs BLUE ROUGE – Recall basedROUGE – Recall based
• Separately evaluate 1,2,3, and 4 –gramsSeparately evaluate 1,2,3, and 4 –grams• No length penalty No length penalty • Verified for extraction summariesVerified for extraction summaries• Focus on content overlapFocus on content overlap
BLUE-Precision basedBLUE-Precision based• Mixed n-gramsMixed n-grams• Use Brevity penalty to penalize system Use Brevity penalty to penalize system
translations that are shorter than the average translations that are shorter than the average reference lengthreference length
• Favors longer n-grmas for grammaticality or Favors longer n-grmas for grammaticality or word orderword order
By all measures By all measures
FindingsFindings Ngram(1,4) is a weighted variable length n-gram match Ngram(1,4) is a weighted variable length n-gram match
score similar to BLEUscore similar to BLEU Simple unigrams, Ngram(1,1) and Bigrams Ngram(2,2) Simple unigrams, Ngram(1,1) and Bigrams Ngram(2,2)
consistently outperformed Ngram(1,4) in single and consistently outperformed Ngram(1,4) in single and multiple document tasks when stopwords are ignoredmultiple document tasks when stopwords are ignored
Weighted average n-grams are between bi-gram and tri-Weighted average n-grams are between bi-gram and tri-gram scores suggesting summaries are over-penalized by gram scores suggesting summaries are over-penalized by the weighted average due to lack of longer n-gram matchesthe weighted average due to lack of longer n-gram matches
Excluding stopword in computing n-gram statistics Excluding stopword in computing n-gram statistics generally achieves better correlation than including themgenerally achieves better correlation than including them
Ngram(1,1) and Ngram(2,2) are good automatic scoring Ngram(1,1) and Ngram(2,2) are good automatic scoring metrics based on statistical predictive power.metrics based on statistical predictive power.