tdt 2004 evaluation workshop, nist, december 2-3, 2004 creating the tdt5 corpus and 2004 evaluation...
TRANSCRIPT
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
Creating the TDT5 Corpus and 2004 Evaluation Topics at LDC
Stephanie Strassel, Meghan Glenn, Junbo Kong
Linguistic Data Consortium
{strassel, mlglenn, junbok}@ldc.upenn.edu
www.ldc.upenn.edu/Projects/TDT5
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
What’s new in TDT5?
Same fundamental concepts Story, event, topic
New multilingual corpus Much larger than previous corpora Newswire only
New topic selection strategy, more topics 250 topics; ~25% multilingual
New topic labeling strategy Search-guided, but time-limited
New annotation toolkit, infrastructure Multilingual, multiplatform, database Highly customized for TDT task
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
Basic Concepts STORY
In TDT2, story is “a section containing at least two independent declarative clauses on same topic”
In TDT3, definition modified to capture annotators’ intuitions about what constitutes story Distinguish “preview/teaser” and complete news story
TDT4 preserves this content-based story definition In TDT5, no manual story segmentation
Newswire comes with story boundaries; all documents are stories
EVENT A specific thing that happens at a specific time and
place along with all necessary preconditions and unavoidable consequences
TOPIC An event or activity along with all directly related events
and activities
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
Corpus Overview
AFA Agence France Presse 30,590 2,640 8.63%ANN An-Nahar 8,162 246 3.01%UMM Ummah Press 1,103 18 1.63%XIA Xinhua News Agency 33,050 3,532 10.69%
AFE Agence France Presse 95,427 3,992 4.18%APE Associated Press 104,909 5,309 5.06%CNE Central News Agency - Taiwan 1,117 27 2.42%LAT LA Times/Washington Post 6,692 313 4.68%NYT New York Times 12,021 515 4.28%UME Ummah Press 1,099 10 0.91%XIE Xinhua News Agency 56,794 1,675 2.95%
AFC Agence France Presse 5,640 984 17.45%CNA Central News Agency - Taiwan 4,568 801 17.54%XIN Xinhua News Agency 37,240 4,303 11.55%ZBN Zaobao News Agency 9,011 1,888 20.95%
# documents % on-topicname
Mandarin
Arabic
English
on-topic docslanguage source
April – September 2003
Newswire only
Translations provided by ISI (thanks to Ignacio Thayer & Kevin Knight)
Distributed to sites in early September by LDC & NIST
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
Topic Selection Strategy Topic selection strategy
Source/date balanced “seed” story lists 30,239 seeds generated (7.42% of corpus) 12,415 reviewed
Seeds that describe an event become candidate topics 3106 candidate topics identified For all candidates, annotators record
• Title, seminal event, who/what/when/where• Estimated topic size • Multilingual potential
Candidate topics reviewed for suitability as final topics Exclude same-language exact duplicates, but No avoidance of hierarchical or overlapping topics
• But no extra effort to include them Select range of topic types, sizes
• No avoidance of “singletons” Also consider annotator preferences
Later feeds
into topic definition
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
2004 Evaluation Topics
250 final topics selected from candidates Equal balance across languages
Topics by Language
0
10
20
30
40
50
60
70
English Chinese Arabic Multi
English-Arabic
English-Chinese
English-Arabic- Chinese
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
Topic Research
‒ Annotator spends up to 1 hour/topic web searching for information
‒ Fills in missing details‒ Provides context, scope‒ Annotators specialize in
particular topics (of their choosing)
‒ Create topic profile that includes brief narrative plus information like ‒ timelines‒ maps‒ keywords‒ named entities‒ links to other online resources
‒ Feeds directly into later annotation queries
Completed for each evaluation topic
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
Topic Explication
After topic research, annotator provides topic explication Apply rule of interpretation to convert event to topic
13 rules state, for each type of seminal event, what other types of events are related
4. Natural Disasters e.g., 30002: Hurricane MitchSeminal events include: weather events (El Nino, tornadoes, hurricanes, floods, droughts), other natural events like volcanic eruptions, wildfires, famines and the like, rescue efforts, coverage of economic or human impact of the disaster. Topic includes: the causal (weather/natural) activity including predictions thereof, the disaster itself, victims and other losses, evacuations and rescue/relief efforts.
1. Elections 8. Financial News2. Scandals 9. New Laws3. Legal/Criminal Cases 10. Sports News4. Natural Disasters 11. Political/Diplomatic Meetings5. Accidents 12. Celebrity/Human Interest6. Acts of Violence/War 13. Miscellaneous7. Science/Discovery News
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
Topic Definition
‒ Fixed format to enhance consistency
‒ Seminal event ‒ who/what/
when/where‒ Topic explication‒ Rule of
interpretation link‒ Topic research
link‒ Seed story link‒ Feeds directly into
topic annotation
After topic research and topic explication are complete, annotator creates final topic definition
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
Annotation Strategy Overview
Search-guided annotation One topic at a time Multiple stages for each topic Two-way topic labeling decision Time-limited: no more than 3 hours per topic
Annotation may be incomplete for a given topic Relevance Labels
YES: story discusses the topic in a substantial way NO: story does not discuss the topic at all, or only mentions the
topic in passing without giving any information about the topic No BRIEF in TDT4 or TDT5 “Difficult Decision” label for tricky decisions
Completeness Judgment Each topic also marked “complete” or “incomplete” at
conclusion
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
Annotation Search Stages Stage 1: Initial query (60 minutes)
Submit seed story as query to search engine Read through resulting relevance-ranked list of 200documents Label each story as YES/NO Stop after finding 5-10 on-topic stories, or After reaching “off-topic threshold”
At least 2 off-topic stories for every 1 OT read AND The last 10 consecutive stories are off-topic
Stage 2: Topic profile-based queries (45 minutes) Issue new query drawn from text within topic research & topic definition Read and annotate stories in resulting relevance-ranked list until reaching
off-topic threshold Stage 3: Improved query using stories from Stage 1-2 (45 minutes)
Issue new query using concatenation of all or some known OT stories Read and annotate stories in resulting relevance-ranked list until reaching
off-topic threshold Stage 4: Creative searching (30 minutes)
Free (iterative) text query Annotators instructed to use specialized knowledge, think creatively to find
novel ways to identify additional OT stories
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
“Hits” by Query Type
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
seed profile ontopic query1 query2 query3
Arabic
English
Mandarin
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
Annotation Time per Topic
0
20
40
60
80
100
120
under 1 1-1.5 1.5-2 2-2.5 2.5-3 3+
Annotation Time in Hours
Nu
mb
er
of
To
pic
s
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
Annotation Time,Topic Completeness &
Topic Size
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
under 1 1-1.5 1.5-2 2-2.5 2.5-3 3+
Annotation Time (in hours)
Co
mp
lete
To
pic
s
0
50
100
150
200
250
300
350
400
450
500
Ave
rag
e D
ocs
Lab
eled
/To
pic
% complete
average docs/topic
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
Additional Annotation & QC Top-Ranked Off-Topic Stories (TROTS)
By community consensus, not provided in 2004 Precision
All on-topic (YES) stories reviewed by senior annotator to identify false alarms All “not easy” off-topic stories reviewed
Adjudication Review pooled site results and adjudicate cases of disagreement with LDC annotators’ judgments
Pooled 4 sites’ tracking results Reviewed all purported LDC FAs Reviewed portion of purported LDC Misses Priorities
• 4/4 sites disagree with LDC• 3/4 sites disagree with LDC• Incomplete topics
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
Topic Size andAdjudication Changes
0%
2%
4%
6%
8%
10%
12%
14%
16%
0 500 1000 1500 2000
Topic Size (Number of Documents Judged)
% J
ud
gm
en
ts (
Y&
N)
Ch
an
ge
d
Du
rin
g A
dju
dic
ati
on
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
Topic Hits andAdjudication Changes
0%
2%
4%
6%
8%
10%
12%
14%
16%
0 50 100 150 200 250 300 350 400
Number of On-Topic Stories
% J
ud
gm
en
ts (
Y&
N)
Ch
an
ged
D
uri
ng
Ad
jud
icati
on
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
Adjudication & Difficult Topics
0%
2%
4%
6%
8%
10%
12%
14%
16%
0% 5% 10% 15% 20% 25% 30% 35%
% judgments marked "difficult"
% J
ud
gm
en
ts (
Y&
N)
Ch
an
ge
d
Du
rin
g A
dju
dic
ati
on
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
Adjudication & Difficult Topics
0%
2%
4%
6%
8%
10%
12%
14%
16%
0% 5% 10% 15% 20% 25% 30% 35%
% judgments marked "difficult"
% J
ud
gm
en
ts (
Y&
N)
Ch
an
ge
d
Du
rin
g A
dju
dic
ati
on
55125-E. Sweden
rejects Euro
55106-E.Bombing in Riyadh
55200-E.Iraq
Antiquities
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
Adjudication & Difficult Topics
0%
2%
4%
6%
8%
10%
12%
14%
16%
0% 5% 10% 15% 20% 25% 30% 35%
% judgments marked "difficult"
% J
ud
gm
en
ts (
Y&
N)
Ch
an
ge
d
Du
rin
g A
dju
dic
ati
on
55125-E. Sweden
rejects Euro
55106-E.Bombing in Riyadh
55200-E.Iraq
Antiquities
-many on-topics-overlapping-terrorism or
MidEast
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
TDT Annotation Toolkit
Choose Task Topic Selection Topic Research, Definition
Topic Labeling, Go to Next Stage
Free Text Query Topic Complete?
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004
TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004