tdt 2004 evaluation workshop, nist, december 2-3, 2004 creating the tdt5 corpus and 2004 evaluation...

TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004

Creating the TDT5 Corpus and 2004 Evaluation Topics at LDC

Stephanie Strassel, Meghan Glenn, Junbo Kong

Linguistic Data Consortium

{strassel, mlglenn, junbok}@ldc.upenn.edu

www.ldc.upenn.edu/Projects/TDT5


What’s new in TDT5?

Same fundamental concepts Story, event, topic

New multilingual corpus Much larger than previous corpora Newswire only

New topic selection strategy, more topics 250 topics; ~25% multilingual

New topic labeling strategy Search-guided, but time-limited

New annotation toolkit, infrastructure Multilingual, multiplatform, database Highly customized for TDT task


Basic Concepts STORY

In TDT2, story is “a section containing at least two independent declarative clauses on same topic”

In TDT3, definition modified to capture annotators’ intuitions about what constitutes story Distinguish “preview/teaser” and complete news story

TDT4 preserves this content-based story definition In TDT5, no manual story segmentation

Newswire comes with story boundaries; all documents are stories

EVENT A specific thing that happens at a specific time and

place along with all necessary preconditions and unavoidable consequences

TOPIC An event or activity along with all directly related events

and activities


Corpus Overview

AFA Agence France Presse 30,590 2,640 8.63%ANN An-Nahar 8,162 246 3.01%UMM Ummah Press 1,103 18 1.63%XIA Xinhua News Agency 33,050 3,532 10.69%

AFE Agence France Presse 95,427 3,992 4.18%APE Associated Press 104,909 5,309 5.06%CNE Central News Agency - Taiwan 1,117 27 2.42%LAT LA Times/Washington Post 6,692 313 4.68%NYT New York Times 12,021 515 4.28%UME Ummah Press 1,099 10 0.91%XIE Xinhua News Agency 56,794 1,675 2.95%

AFC Agence France Presse 5,640 984 17.45%CNA Central News Agency - Taiwan 4,568 801 17.54%XIN Xinhua News Agency 37,240 4,303 11.55%ZBN Zaobao News Agency 9,011 1,888 20.95%

# documents % on-topicname

Mandarin

Arabic

English

on-topic docslanguage source

April – September 2003

Newswire only

Translations provided by ISI (thanks to Ignacio Thayer & Kevin Knight)

Distributed to sites in early September by LDC & NIST


Topic Selection Strategy Topic selection strategy

Source/date balanced “seed” story lists 30,239 seeds generated (7.42% of corpus) 12,415 reviewed

Seeds that describe an event become candidate topics 3106 candidate topics identified For all candidates, annotators record

• Title, seminal event, who/what/when/where• Estimated topic size • Multilingual potential

Candidate topics reviewed for suitability as final topics Exclude same-language exact duplicates, but No avoidance of hierarchical or overlapping topics

• But no extra effort to include them Select range of topic types, sizes

• No avoidance of “singletons” Also consider annotator preferences

Later feeds

into topic definition


2004 Evaluation Topics

250 final topics selected from candidates Equal balance across languages

Topics by Language

0

10

20

30

40

50

60

70

English Chinese Arabic Multi

English-Arabic

English-Chinese

English-Arabic- Chinese


Topic Research

‒ Annotator spends up to 1 hour/topic web searching for information

‒ Fills in missing details‒ Provides context, scope‒ Annotators specialize in

particular topics (of their choosing)

‒ Create topic profile that includes brief narrative plus information like ‒ timelines‒ maps‒ keywords‒ named entities‒ links to other online resources

‒ Feeds directly into later annotation queries

Completed for each evaluation topic


Topic Explication

After topic research, annotator provides topic explication Apply rule of interpretation to convert event to topic

13 rules state, for each type of seminal event, what other types of events are related

4. Natural Disasters e.g., 30002: Hurricane MitchSeminal events include: weather events (El Nino, tornadoes, hurricanes, floods, droughts), other natural events like volcanic eruptions, wildfires, famines and the like, rescue efforts, coverage of economic or human impact of the disaster. Topic includes: the causal (weather/natural) activity including predictions thereof, the disaster itself, victims and other losses, evacuations and rescue/relief efforts.

1. Elections 8. Financial News2. Scandals 9. New Laws3. Legal/Criminal Cases 10. Sports News4. Natural Disasters 11. Political/Diplomatic Meetings5. Accidents 12. Celebrity/Human Interest6. Acts of Violence/War 13. Miscellaneous7. Science/Discovery News


Topic Definition

‒ Fixed format to enhance consistency

‒ Seminal event ‒ who/what/

when/where‒ Topic explication‒ Rule of

interpretation link‒ Topic research

link‒ Seed story link‒ Feeds directly into

topic annotation

After topic research and topic explication are complete, annotator creates final topic definition


Annotation Strategy Overview

Search-guided annotation One topic at a time Multiple stages for each topic Two-way topic labeling decision Time-limited: no more than 3 hours per topic

Annotation may be incomplete for a given topic Relevance Labels

YES: story discusses the topic in a substantial way NO: story does not discuss the topic at all, or only mentions the

topic in passing without giving any information about the topic No BRIEF in TDT4 or TDT5 “Difficult Decision” label for tricky decisions

Completeness Judgment Each topic also marked “complete” or “incomplete” at

conclusion


Annotation Search Stages Stage 1: Initial query (60 minutes)

Submit seed story as query to search engine Read through resulting relevance-ranked list of 200documents Label each story as YES/NO Stop after finding 5-10 on-topic stories, or After reaching “off-topic threshold”

At least 2 off-topic stories for every 1 OT read AND The last 10 consecutive stories are off-topic

Stage 2: Topic profile-based queries (45 minutes) Issue new query drawn from text within topic research & topic definition Read and annotate stories in resulting relevance-ranked list until reaching

off-topic threshold Stage 3: Improved query using stories from Stage 1-2 (45 minutes)

Issue new query using concatenation of all or some known OT stories Read and annotate stories in resulting relevance-ranked list until reaching

off-topic threshold Stage 4: Creative searching (30 minutes)

Free (iterative) text query Annotators instructed to use specialized knowledge, think creatively to find

novel ways to identify additional OT stories


“Hits” by Query Type

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

seed profile ontopic query1 query2 query3

Arabic

English

Mandarin


Annotation Time per Topic

0

20

40

60

80

100

120

under 1 1-1.5 1.5-2 2-2.5 2.5-3 3+

Annotation Time in Hours

Nu

mb

er

of

To

pic

s


Annotation Time,Topic Completeness &

Topic Size

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

under 1 1-1.5 1.5-2 2-2.5 2.5-3 3+

Annotation Time (in hours)

Co

mp

lete

To

pic

s

0

50

100

150

200

250

300

350

400

450

500

Ave

rag

e D

ocs

Lab

eled

/To

pic

% complete

average docs/topic


Additional Annotation & QC Top-Ranked Off-Topic Stories (TROTS)

By community consensus, not provided in 2004 Precision

All on-topic (YES) stories reviewed by senior annotator to identify false alarms All “not easy” off-topic stories reviewed

Adjudication Review pooled site results and adjudicate cases of disagreement with LDC annotators’ judgments

Pooled 4 sites’ tracking results Reviewed all purported LDC FAs Reviewed portion of purported LDC Misses Priorities

• 4/4 sites disagree with LDC• 3/4 sites disagree with LDC• Incomplete topics


Topic Size andAdjudication Changes

0%

2%

4%

6%

8%

10%

12%

14%

16%

0 500 1000 1500 2000

Topic Size (Number of Documents Judged)

% J

ud

gm

en

ts (

Y&

N)

Ch

an

ge

d

Du

rin

g A

dju

dic

ati

on


Topic Hits andAdjudication Changes

0%

2%

4%

6%

8%

10%

12%

14%

16%

0 50 100 150 200 250 300 350 400

Number of On-Topic Stories

% J

ud

gm

en

ts (

Y&

N)

Ch

an

ged

D

uri

ng

Ad

jud

icati

on


Adjudication & Difficult Topics

0%

2%

4%

6%

8%

10%

12%

14%

16%

0% 5% 10% 15% 20% 25% 30% 35%

% judgments marked "difficult"

% J

ud

gm

en

ts (

Y&

N)

Ch

an

ge

d

Du

rin

g A

dju

dic

ati

on



0%

2%

4%

6%

8%

10%

12%

14%

16%

0% 5% 10% 15% 20% 25% 30% 35%


% J

ud

gm

en

ts (

Y&

N)

Ch

an

ge

d

Du

rin

g A

dju

dic

ati

on

55125-E. Sweden

rejects Euro

55106-E.Bombing in Riyadh

55200-E.Iraq

Antiquities



0%

2%

4%

6%

8%

10%

12%

14%

16%

0% 5% 10% 15% 20% 25% 30% 35%


% J

ud

gm

en

ts (

Y&

N)

Ch

an

ge

d

Du

rin

g A

dju

dic

ati

on

55125-E. Sweden

rejects Euro

55106-E.Bombing in Riyadh

55200-E.Iraq

Antiquities

-many on-topics-overlapping-terrorism or

MidEast


TDT Annotation Toolkit

Choose Task Topic Selection Topic Research, Definition

Topic Labeling, Go to Next Stage

Free Text Query Topic Complete?