topics detection and tracking

Topics Detection and Tracking

Presented by CHU Huei-Ming 2004/03/17

2

Reference

• Pattern Recognition in Speech and Language Processing– Chap. 12 Modeling Topics for Detection and Tracking

– James Allan– University of Massachusetts Amherst– Publisher:CRC Pr I Llc Published 2003/02

• UMass at TDT 2004– Margaret Connel, Ao Feng, Giridhar Kumaran, Hema Raghavan,

Chirag Shah, James Allan– University of Massachusetts Amherst– TDT 2004 workshop

3

Topic Detection and Tracking (1/6)

• The goal of TDT research is to organize news stories by the events that they describe.

• The TDT research program began in 1996 as a collaboration between Carnegie Mellon University, Dragon Systems, the University of Massachusetts and DARPA

• To find out how well classic IR technologies addressed TDT, they created a small collection of news stories and identified some topics within them

4


• Event – something that happen at some specific time and place, along with

all necessary preconditions and unavoidable consequenes

• Topic– capture the larger set of happenings that are related to some

triggering event

– By forcing the additional events to be directly related, the topic is prevented from spreading out to include too much news

5


• TDT Tasks– Segmentation

• Break an audio track into discrete stories, each on a single topic

– Cluster Detection (Detection)• Place all arriving news stories into groups based on their topics• If no existing group’s , the system must decide whether to create a

new topic• Each story is placed in precisely one cluster

– Tracking• Starts with a small set of news stories that a user has identified as

being on the same topic• The system must monitor the stream of arriving news to find all

additional stories on the same topic

6


– New Event Detection (first story detection)• Focuses on the cluster creation aspect of cluster detection

• Evaluated on its ability to decide when a new topic (event) appears

– Link Detection• Determine weather or not two randomly presented stories discuss the

same topic

• The solution of this task could be used to solve new event detection

7


• Corpora

• TDT-2 : in 2002 is being augmented with some Arabic news from the same time period

• TDT-3 : it is created for 1999 evaluation , and stories from four Arabic sources are being added during 2002

Stage Source Number of stories

Topics duration

Pilot study CNN, Reuters 16,000 25 1994 7~12,1995 1~6

TDT-2 six English source and three Chinese source

80,000 100 1998 1~6

TDT-3 eight English source and three Chinese source

40,000 120 1998 10~12

TDT-4 eight English source ,

three Chinese source and

four Arabic sources

45,000 60 2000 10~12, 2001 1

8


• Evaluation

– P(target) is the prior probability that a story will be on topic

– Cx are the user-specified values that reflect the cost associated with each error

– P(miss) and P(fa) is the actual system error rates

– Within TDT evaluations, Cmiss=10 , Cfa=1

– P(target) = 1- P(off-toget) = 0.02 (derived from training data)

target)-(off(fa)C (target)(miss)C Cost famiss PPPP

9

Basic Topic Model• Vector Space

– Represent items as (stories or topics) as vector in a high dimensional space

– The most common comparison function is the cosine of the angle between the two vectors

• Language Models– A topic is represented as a probability distribution of words

– The initial probability estimates come form the maximum likelihood estimate based on the document

• Use of topic model – See how likely the particular story could be generated by the model

– Compare them directly : symmetric version of Kullback-Leibler divergence

tftfwP wml /)(

)||()||( 1221 MMDMMD

22vu

vu

story

)|()|story(w

MwPMP

10

Implementing the Models (1/3)

• Name Entities– News is usually about people, so it seems reasonable that their

names could be treated specially

– Treat the name entities as a separate part of the model and then merge the part

– Boost the weight of any words in the stories that come from names, give them a larger contribution to the similarity when the names are in common

– Improve the result slightly, no strong stress so far

11


• Document Expansion– In the segmentation task, a possible segmentation boundary could

be checked by comparing the models generated by text on either side

– The text could be used as a query to retrieve a few dozen related stories and then the most frequently occurring words from those stories could be used for the comparison

– Relevance models results in substantial improvements in the link detection task

12


• Time decay – The likelihood that two stories discuss the same topic diminished

as the stories are further separated in time

– In a vector space model, the cosine similarity function can be changed so that it include a time decay

13

Comparing model (1/3)

• Nearest Neighbors– In the vector space model, a topic might be represented as a single

vector

– To determine whether or not that story is on any of the existing topics we consider the distance between the story’s vector and the closest topic vector

– If it falls outside the specified distance, the story is likely to be the seed of a new topic and a new vector can be formed

14


• Decision Trees– The best place of decision trees within TDT may be the

segmentation task

– There are numerous training instances (hand-segmented stories)

– Finding features that are indicative of a story boundary is possible and achieves good quality

15


• Model-to-Model– Direct comparison of statistical language models that represent

topics

– Kullback-Leibler idvergence

– To finesse the measure, calculate the both ways and add them together

– One approach that has been used to incorporate that notion penalized the comparison if the models are too much like background news

x xq

xpxpqpD

)(

)(log)()||(

)||()||( 1221 MMDMMD

)||()||( 121 newsMDMMD

16

Miscellaneous Issues (1/3)

• Deferral– All of tasks are envisioned as “on-line” task

– The decision about a story is expected before the next story is presented

– In fact, TDT provides a moderate amount of look ahead for the tasks

– First, stories are always presented to the system grouped into “files” that correspond to about a half hour of news

– Second, the formal TDT evaluation incorporates a notion of deferral that allows a system to explore the advantage of deferring decisions until several files have passed.

17


• Multi-modal Issues– TDT systems must deal with are either written text (newswire) or

read text (audio)

– Speech recognizers make numerous mistakes, inserting, deleting, and even completely transforming words into other words

– The difference of the two modes is the score normalization

– The pair of story drawn from different source the distribution is different, in order the score is comparable, a system needs to normalize depends on those modes

18


• Multi-lingual Issues– The TDT research program has strong interest in evaluating the

tasks across multiple languages

– 1999~2001 sites were required to handle English and Chinese news story

– 2002 sites will be incorporating Arabic as a third language

19

Using TDT Interactively (1/2)

• Demonstrations– Lighthouse is a prototype system that visually portrays inter-document

similarities to help the user find relevant material more quickly

20

Using TDT Interactively (2/2)

• Timelines– Using a timeline to show not only what the topic are, but how they occur

in time

– Using X 2 measure to determine whether or not that feature is occurring on that day in a unusual way

21

UMass at TDT 2004

• Hierarchical Topic Detection

• Topic Tracking

• New Event Detection

• Link Detection

22

Hierarchical Topic DetectionModel Description (1/8)

• This task replaces Topic Detection in previous TDT evaluations

• Used vector space model as the based line• Bounded clustering to reduce time complexity and had

some simple parameter tuning• Stories in the same event tend to be close in time, we only

need to compare a story to its “local” stories instead of the whole collection

• Two steps – Bounded 1-NN for event formation – Bounded agglomerative clustering for building the hierarchy

23


• Bounded 1-NN for event formation– All stories in the same original language and from the some source are

taken out and time ordered

– Stories are processed one by one and each incoming story is compared to a certain number of stories(100 for baseline) before it.

– Similarity of the current story and the most similar previous story is lager than a given threshold (0.3 for baseline) the current story will be assigned to the event that the most similar previous story belongs to, otherwise, a new event is created

– There is a list of events for each source/language class– The event within each class are sorted by time according to the time

stamp of the first story

24


• Bounded 1-NN for event formation

S2 S1S3S1 S2

Language A Language B

25


• Each source is segmented in to several parts, and sorted by time according to the time stamp of the first story

• Sorted event list

26


• Bounded agglomerative clustering for building the hierarchy– Take a certain number of events (the number is called WSIZE

default is 120) from the sorted event list

– At each iteration, find the closest event pair and combine the later event to the earlier one

27


• Each iteration find the closest event pair and combine the later event to the earlier one

I1 I2 I3 Ir-1 Ir

28


• Bounded agglomerative clustering for building the hierarchy– Continues for (BRANCH-1)WSIZE/BRANCH iterations,

so the number of clusters left is WSIZE/BRANCH

– Take the first half out and get WSIZE/2 new events and agglomerative cluster until WSIZE/BRANCH clusters left

– The optimal value is around 3, BRANCH=3 as baseline

29


• Then all clusters in the same language but from difference sources are combined

• Finally clusters from all languages are mixed and clustered until only one cluster is left, which become the root

• Used machine translation for Arabic and Mandarin stories to simplify the similarity calculation

30

Hierarchical Topic DetectionTraining (1/4)

• Training corpus : TDT4 – newswire and broadcast stories Testing corpus : TDT5 – newswire only

• Taking newswire stories from the TDT4 corpus includes NYT, APW, ANN, ALH, AFP, ZBN, XIN 420,000 stories

AFA Newswire 19126 --- ---ALH Newswire 10656 --- ---ANN Newswire 9682 --- ---VAR Radio 2378 68 commercialNTV (Web) Television 871 20.5 commercial

APW Newswire 10268 --- ---NYT Newswire 4842 --- ---VOA Radio 2694 70 commercial + spell checkPRI Radio 1965 62 commercial + spell checkCNN Television 4698 64.5 closed-captionABC Television 1692 38.5 closed-captionNBC Television 1234 35 closed-caption

MNB Television 997 43 closed-caption

XIN Newswire 9837 --- ---ZBN Newswire 8114 --- ---VOM Radio 1780 64 commercialCNR (Web) Radio 2259 43 commercialCTV Television 1483 32.5 commercialCTS (Web) Television 2221 44 commercialCBS (Web) Television 1451 34 commercial

BBN

LIMSI

BBN

IBM TJ Watson

Research Center

N/A

Systran (run at LDC)

ASRmachine

translationreference transcripts

Arabic

English

Mandarin

language source data type # documentstotal audio

(hours)

TDT-4 Corpus Overview

31


TDT-5 Corpus Content

Language Source Doc Count

Arabic AFA (Agence France Presse) 30,593

Arabic ANN (An-Nahar) 8162

Arabic UMM (Ummah) 1104

Arabic XIA (Xinhua) 33,051

Arabic Total 72,910

English AFE (Agence France Presse) 95,432

English APE (Associated Press) 104,941

English CNE (CNN) 1117

English LAT (LA Times/Washington Post) 6692

English NYT (New York Times) 12,024

English UME (Ummah) 1101

English XIE (Xinhua) 56,802

English Total 278,109

Mandarin AFC (Agence France Presse) 5655

Mandarin CNA (China News Agency) 4569

Mandarin XIN (Xinhua) 37,251

Mandarin ZBN (Zaobao News) 9011

Mandarin Total 56,486

Corpus Total 407,505

32


• Parameters – BRANCH : average branching factor in the bounded

agglomerative clustering algorithm

– Threshold : in the event formation to decide if a new event will be created

– STOP : in each source, the number of cluster is smaller than square root of the number of story

– WSIZE : the maximum window size in agglomerative clustering

– NSTORY: Each story will be compared to at most NSTORY stories before it in the 1-NN event clustering, the idea comes from the time locality in event threading

33


• Among the clusters very close to the root node, some contains thousands of stories.

• Both 1-NN and agglomerative clustering algorithms favor large clusters

• Modified the similarity calculation to give smaller clusters more chances

• Sim(v1,v2) is the similarity of the cluster centroids

• |cluster1| is the number of stories in the first story

• a is a constant to control how much favor smaller clusters can get

acluster

vvsimclusterclustersim

1

)2,1()2,1(

34

Hierarchical Topic DetectionResult (1/2)

• Three runs for each condition: UMASSv1, UMASSv12 and UMASSv19

Parameters Description UMASSv1 UMASSv12 UMASSv19

KNN bound Previous stories compared number

100 100 100

SIM Similarity function Cluster centroid Cluster centroid normalized, a=0

The same

WEIGHT Vector weight scene The same The same

THRESH Threshold for KNN 0.3 0.3 0.3

WSIZE Maximum number of clusters in agglomerative clustering

120 120 240

BRANCH Average branching factor 3 3 3

STOP Decides when clusters from different sources are mixed

5 5 5

)1log(

)/5.0log(

N

DFNIDF

35

Hierarchical Topic DetectionResult (2/2)

• Small branching factor can reduce both detection cost and travel cost

• Small branching factor, there are more clusters with different granularities

• The assumption of temporal locality is useful in event threading, more experiments after the submission show larger window size can improve performance

36

Conclusion

• Discussed several of the techniques that systems have used to build or enhance those models and listed merits of many of them

• The TDT researchers can extent to which IR technology can be used to solve TDT problems

topics detection and tracking

Documents