bcs_seminar.ppt

24
Lexical Chains for Topic Detection and Tracking British Classification Society Feb 23rd 2001 Joe Carthy & Nicola Stokes University College Dublin [email protected] [email protected] http://www.cs.ucd.ie/staff/jcarthy Tel. +353 1 706 2481 or 706 2469 Fax. +353 1 269 7262

Upload: hondafanatics

Post on 11-May-2015

426 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BCS_Seminar.ppt

Lexical Chains for Topic Detection and Tracking

British Classification Society

Feb 23rd 2001

Joe Carthy & Nicola StokesUniversity College Dublin

[email protected]@ucd.ie

http://www.cs.ucd.ie/staff/jcarthyTel. +353 1 706 2481 or 706 2469

Fax. +353 1 269 7262

Page 2: BCS_Seminar.ppt

Topic Detection and Tracking

• Topic Detection and Tracking (TDT)– DARPA funded TDT project with UMass, CMU and Dragon

Systems– Domain is all broadcast news: written and spoken

• TDT includes:– First story Detection – Event Tracking– Segmentation

• Applications– digital news editors– media analysts – equity traders

Page 3: BCS_Seminar.ppt

Topic Tracking and Detection

• Tracking may be defined as– Take a corpus of news stories– Given 1 (or 2,4,8,16) sample stories about an

event– Find all subsequent stories in the corpus

about that event

• Detection: Is this a new story ?

Page 4: BCS_Seminar.ppt

• Event is defined by a list of stories that discuss the event

e.g.

“Kobe earthquake”

is defined by first story that describes this event

Topic Tracking and Detection

Page 5: BCS_Seminar.ppt

5

UCD TDT ARCHITECTURE

SERVER

Lexical Chainer

Event Tracker Event Detector

Page 6: BCS_Seminar.ppt

6

Topic Detection and Tracking

DATE: 02:36TITLE: O.J. SIMPSON

Bought Knife, Murder Hearing told

O.J. SIMPSON MURDER TRIAL

NYC SUBWAY BOMBINGS

CARLOS THE JACKEL

DATA STREAM

Previous Stories

Page 7: BCS_Seminar.ppt

• Implemented Benchmark systems using

conventional IR techniques:

– Stemmed keywords

– Stopword removal(Porter)

– Term weighting (Robertson, Sparck Jones)

Benchmark Systems

Page 8: BCS_Seminar.ppt

8

Lexical Chaining

– Lexical chains - textual cohesion (Halliday & Hasan)

– Cohesion: text makes sense as a whole

– Cohesion occurs where the interpretation of one item is dependent of that of another item in the text. It is this dependency that gives rise to cohesion.

Page 9: BCS_Seminar.ppt

9

Lexical Chaining

– Where the cohesive elements occur over a number of sentences a cohesive chain is formed.

– For example, the sentences:

John had mud pie for dessert. Mud pie is made of chocolate. John really enjoyed it.

– give rise to the lexical chain: {mud pie, dessert, mud pie, chocolate, it}

– Lexical cohesion is as the name suggests lexical - it involves the selection of a lexical item that is in some way related to one occurring previously.

Page 10: BCS_Seminar.ppt

10

Lexical Chaining

– Reiteration is a form of lexical cohesion which involves the repetition of a lexical item.

This may involve simple repetition of the word but also includes the use of a synonym, near-synonym or superordinate.

For example in the sentences John bought a Jag. He loves the car. a superordinate, car, refers back to a subordinate Jag.

The part-whole relationship is also an example of lexical cohesion e.g. airplane and wing.

– A lexical chain is a sequence of related words in the text, spanning short or long distances.

Page 11: BCS_Seminar.ppt

11

Lexical Chaining

– A chain is independent of the grammatical structure of the text and in effect it is a list of words that captures a portion of the cohesive structure of the text.

– A lexical chain can provide a context for the resolution of an ambiguous term and enable identification of the concept the term represents i.e. word sense disambiguation

– Morris and Hirst were the first researchers to suggest the use of lexical chains to determine the structure of texts.

Page 12: BCS_Seminar.ppt

12

Lexical Chaining

– By identifying the lexical chains in a news story we hope to identify the focus of a news story. This can then be used in tracking and detection.

– It is important to realise that determining lexical chains is not a sophisticated natural language analysis process.

– Other Applications of Lexical Chaining• Hypertext links: Green• Summarisation: Barzilay• Segmentation: Okumura and Honda• IR: Stairmand, Ellman, Mochizuki• Malapropism detection: St. Onge• Multimedia indexing: Kazman,Al-Halimi

Page 13: BCS_Seminar.ppt

13

Chain Generation

– In order to construct lexical chains we must be able to identify relationships between terms.

– This is made possible by the use of WordNet

– WordNet is a computational lexicon which was developed at Princeton University.

– In WordNet, synonym sets (synsets) are used to represent concepts where a synonym set corresponds to a concept and consists of all those terms that may be used to refer to that concept.

Page 14: BCS_Seminar.ppt

14

Chain Generation

– For example, take the concept airplane it is represented by the synset {airplane, aeroplane, plane}.

– A WordNet synset has a numerical identifier such as 02054514.

– Links between synsets in WordNet represent conceptual relations such as synonymy, hyponymy, meronymy (part-of) etc.

– The synset identifier can be used to represent the concept referred to in the synset, for indexing and lexical chaining purposes.

Page 15: BCS_Seminar.ppt

15

Word Sense Disambiguation

1st Term

EXHAUST

Part of

Has a

Train 3984

Exhaust32748

Railway carriage324932

Automobile057643

Termi

CAR

Car_exhaust32748

Tire_out, Fatigue374222

Page 16: BCS_Seminar.ppt

16

Chain Generation

• Chaining procedure for a story:– Take the ith term in the story and generate the set Neighbouri of its

related synsets

– For each other term, if it is a member of the set Neighbouri then add it to the lexical chain for termi.

– If the lexical chain contains 3 or more elements then store the chain in a chain index file

– Repeat above for all terms in the story.

Page 17: BCS_Seminar.ppt

17

– Computing Chain_Sim(Trackseti, Storyj )

• Overlap Coefficient which may be defined as follows, for two lexical chains c1 and c2:

• Overlap Coefficient =

| c1 ∩ c2 |min(| c1 |, | c2 |)

Page 18: BCS_Seminar.ppt

18

Evaluation Metrics

– System returns a set of S documents :• a = # in S discussing new events

• b = # in S not discussing new events

• c = # in S' discussing new events

• d = # in S' not discussing new events

– Recall = a / (a+c)– Precision = a / (a+b)– Miss Rate = c / (a+c) = 1 - R– False Alarm Rate = b / (b+d) = Fallout

Page 19: BCS_Seminar.ppt

19

Tracking Results

Average Recall vs Threshold

0

10

20

30

40

50

60

70

80

90

100

0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

Threshold

LexTrack-O Nt = 1

KeyTrack Nt = 1

LexTrack Nt = 1

Page 20: BCS_Seminar.ppt

20

Tracking Results

%Miss Rate vs Threshold

0

10

20

30

40

50

60

70

80

90

100

0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

Threshold

LexTrack-O Nt=1

KeyTrack Nt=1

LexTrack Nt=1

Page 21: BCS_Seminar.ppt

21

Detection Results

Detection Performance

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

% False Alarms

% Misses

Lex_DetectTRADCHAINS_ONLY

Page 22: BCS_Seminar.ppt

22

Analysis of results

– Expected trade-off between precision and recall

– Small number of stories are sufficient to construct a tracking query

– Performance in line with other TDT researchers

– Lexical Chains - Improvement not significant ?

Page 23: BCS_Seminar.ppt

http://www.cs.ucd.ie/staff/jcarthy 23

TDT and Lexical Chain References

• Allan, J., Carbonell, J., Doddington, G., Yamron, J, and Yang, Y., “Topic Detection and Tracking Pilot Study: Final Report”, Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, Morgan Kaufmann, San Francisco,1998.

• Allan, J., Papka, R., and Lavrenko, V., “Online New Event Detection and Tracking”, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 1998.

• Barzilay, R., “Lexical Chains for Summarization”, M.Sc. Thesis, Ben-Gurion University of the Negev, Israel, November 1997.

• Barzilay, R., and Elhadad, M., “Using Lexical Chains for Text Summarization”, The Fifth Bar-Ilan Symposium on Foundations of Artificial Intelligence Focusing on Intelligent Agents, Bar-Ilan University, Ramat Gan, Israel, June, 1997

• Budanitsky, A., “Lexical Semantic Relatedness and its Application in Natural Language Processing”, (PhD thesis) Technical Report CSRG-390, University of Toronto, 1999.

• Ellman, J., “Using Roget's Thesaurus to Determine the Similarity of Texts”, PhD Thesis, University of Sunderland, 2000.

• Fellbaum, C., (Ed.), WordNet: An Electronic Lexical Database and Some of its Applications, MIT Press, 1998.

• Green, S.J., “Automatically Generating Hypertext by Computing Semantic Similarity”, Ph.D. Thesis, University of Toronto, 1997.

Page 24: BCS_Seminar.ppt

http://www.cs.ucd.ie/staff/jcarthy 24

• Halliday, M.A.K. and Hasan, R., “Cohesion In English”, Longman , 1976.

• Hatch, P., "Lexical Chaining for the Online Detection of New Events", M.Sc. Thesis, University College Dublin, 2000.

• Hirst, G., and St-Onge, D., “Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms”, in WordNet: An Electronic Lexical Database and Some of its Applications, Fellbaum, C., (Ed.), MIT Press, 1998.

• Kazman, R., Al-Halimi, R., Hunt, W., and Mantei, M., “Four Paradigms for Indexing Video Conferences”, IEEE MultiMedia, 3 (1), Spring 1996.

• Mochizuki, H., Iwayama, M., and Okumura, M., “Passage Level Document Retrieval Using Lexical Chains”, RIAO 2000, Content Based Multimedia Information Access, 491-506, 2000.

• Morris J., and Hirst, G., “Lexical Cohesion, the Thesaurus, and the Structure of Text”, Computational Linguistics, 17 (1), 211-232, 1991.

• Okumura, M., and Honda, T., “Word Sense Disambiguation and Text Segmentation Based on Lexical Cohesion”, In Proceedings of the Fifteenth International Conference on Computational Linguistics (COLING-94), Vol. 2, 775-761, Kyoto, Japan, August 1994.

• Porter, M.F., “An Algorithm for Suffix Stripping”, Program, 14, 130-137, 1980.

• Robertson, S.E. and Sparck Jones, K, "Simple Approaches to Text Retrieval", University of Cambridge Computing Laboratory Technical Report Number 356, May 1997.

• Stairmand, M.A., “A Computational Analysis of Lexical Cohesion with Applications in Information Retrieval”, Ph.D. Thesis, UMIST, 1996.

• Stokes, N., Carthy, J., First Story Detection using a Composite Document Representation, HLT 2001, Human Language Technology Confererence, San Diego, California, March 18-21, 2001

• TDT2000, “The Year 2000 Topic Detection and Tracking (TDT2000) Task Definition and Evaluation Plan”, available at the following URL: http://morph.ldc.upenn.edu/TDT/Guide/manual.front.html, November 2000.