discovering evolutionary theme patterns from text - an exploration of temporal text mining

21
1 Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei, ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign U.S.A

Upload: elliot

Post on 25-Feb-2016

41 views

Category:

Documents


3 download

DESCRIPTION

Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining. Qiaozhu Mei, ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign U.S.A. Motivation. Most text collections bear time stamps - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

1

Discovering Evolutionary Theme Patterns from Text

- An Exploration of Temporal Text Mining

Qiaozhu Mei, ChengXiang Zhai

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

U.S.A

Page 2: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

2

Motivation

• Most text collections bear time stamps– News articles, scientific literature, emails, etc.

• Many useful temporal patterns exist– Emerging topics/themes– Decaying topics/themes– Topic evolution thread– Topic/theme life cycles– …

• How do we discover and exploit such patterns?

Page 3: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

3

Theme Evolution Graph (Asia Tsunami)

Immediate Reports

Statistics of Death and loss

Personal Experience of Survivors

Statistics of further impact

Aid from Local Areas Aid from the world

Donations from countries

Specific Events of Aid

…Lessons from Tsunami Research inspired

Time

Doc1Doc3 Doc ..

Theme spans Evolutionary transitions

Theme evolution thread

• Useful for summarizing the news…

Page 4: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

4

Theme Life Cycle (SIGIR Proceedings)

• Useful for revealing historical trends and hot topics…

Theme Strength

Time1980 1990 1998 2003

TF-IDF Retrieval

IR Applications

Language ModelText Categorization

Page 5: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

5

Problem Definition

• Evolutionary Theme Pattern (ETP)– Theme Evolution Graph

• A weighted directed graph in which each vertex is a theme span and each edge is an evolutionary transition

– Theme Life Cycle• The strength of a theme over the whole time line

• Given a text collection with time stamps, the problem of discovering ETP is to – Extract a theme evolution graph– Model the life cycles of the most salient themes

Page 6: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

6

Research Questions

• How to represent a theme?

• How to extract themes from a collection automatically?

• How to model the transitions of themes?

• How to segment the collection with themes?

• How to model and compute the strength of each theme at a given time period?

Page 7: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

7

Our Approach

tt11

12

13

21

22

31

3k

PartitioningPartitioning

Theme Theme Evolution Evolution GraphGraph

Extracting Extracting global global salient salient themesthemes

…… ……

θθ11 θθ22

θθ33

BB

…… ……

Model Model theme theme shiftsshifts

Decoding Decoding CollectionCollection

s

tt

Theme Life cyclesTheme Life cycles

tt

Theme Theme spans spans extractionextraction

……

Collection with Collection with time stampstime stamps

Task I. Theme ExtractionTask I. Theme Extraction

Task II. Transition Task II. Transition ModelingModeling

Task III. Theme Task III. Theme SegmentationSegmentation

Model theme Model theme transitionstransitions

Computing Theme Computing Theme StrengthStrength

t1 t2 t3, …, t

Page 8: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

8

Our Approach (Cont.)

• Extracting Theme Evolution Graph– Partition collection into time intervals– Extract themes from each time span (task I)– Model transitions between theme spans (task II)

• Modeling theme life cycles– Extract most salient themes from the whole

collection (task I)– Segment the collection with themes (task III )– Compute the strength of each theme over time

Page 9: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

9

Task I: Theme Extraction• There are k themes in the collection (or a time span), each document

is a sample of words generated by multiple themes

• Infer the best theme language models that fit our data

Theme 1

Theme k

Theme 2

Background B

warning 0.3 system 0.2..

Aid 0.1donation 0.05support 0.02 ..

statistics 0.2loss 0.1dead 0.05 ..

Is 0.05the 0.04a 0.03 ..

Document d

k

1

2

BB

W

d,1

d, k

1 - Bd,2

“Generating” word w in doc d in the collection

?

??

???

Parameters: B=noise-level (manually set)’s and ’s are estimated with Maximum Likelihood

,1

( : ) ( | ) (1 ) [ ( | )]k

B B B d j jj

p w d p w p w

Page 10: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

10

Task II: Transition Modeling

• Theme spans in an earlier time interval could evolve into theme spans in a later time interval

Tt1 … t2

A

C?

B?microarray 0.2gene 0.1protein 0.05

web 0.3classification 0.1topic 0.1

Information 0.2topic 0.1 classification 0.1text 0.05

Evolutionary Transition

Theme similarity=

• Similarity/distance between two theme spans is modeled with KL Divergence between two distributions

Page 11: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

11

Task III: Theme Segmentation• View the whole collection as a sequence ordered by time,

Model the theme shifts in documents with a Hidden Markov Model

Decoding Decoding CollectionCollectionTheme 1

Theme 3

Theme 2

Background

……

The Collection

θθ11 θθ22

θθ33

BB

output probability P (w|θ)=

Train transition probabilities

w ww ww ww ww ww ww ww ww ww

Page 12: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

12

Our Approach: Revisit

tt11

12

13

21

22

31

3k

PartitioningPartitioning

Theme Theme Evolution Evolution GraphGraph

Extracting Extracting global global salient salient themesthemes

…… ……

θθ11 θθ22

θθ33

BB

…… ……

Model Model theme theme shiftsshifts

Decoding Decoding CollectionCollection

s

tt

Theme Life cyclesTheme Life cycles

tt

Theme Theme spans spans extractionextraction

……

Collection with Collection with time stampstime stamps

Model theme Model theme transitionstransitions

Computing Theme Computing Theme StrengthStrength

t1 t2 t3, …, t

Page 13: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

13

Experiments

• Two data sets:– Asia Tsunami: 7468 news articles spanning 50 days

from 10 news sources – KDD Abstracts: 496 abstracts from 6 years’ KDD

conference proceedings

• On each data set, we extract a theme evolution graph and model the life cycles of global salient themes

Page 14: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

14

Theme Evolution Graph: TsunamiT

aid 0.020relief 0.016U.S. 0.013military 0.011U.N. 0.011…

Bush 0.016U.S. 0.015$ 0.009relief 0.008million 0.008…

Indonesian 0.01military 0.01islands 0.008foreign 0.008aid 0.007…

system 0.0104Bush 0.008warning 0.007conference 0.005US 0.005…

system 0.008China 0.007warning 0.005Chinese 0.005…

warning 0.012system 0.012Islands 0.009Japan 0.005quake 0.003……

……

……

12/28/04 01/05/05 01/15/05 …

Page 15: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

15

Theme Life Cycles: Tsunami

Aid from the world

ResearchAid for children

statistics

Personal experiences

$ 0.0173million 0.0135relief 0.0134aid 0.0099U.N. 0.0066 …

I 0.0322wave 0.0061beach 0.0051saw 0.0046sea 0.0046 …

CNN, Absolute StrengthCNN, Absolute Strength

Page 16: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

16

Theme Life Cycles: Tsunami

Aid from the world

Research

Aid from China

statistics

Scene and Experiences

dollars 0.0226million 0.0204aid 0.0118U.N. 0.0102reconstruction0.0062 …

China 0.0391yuan 0.0180 Beijing 0.0089 $ 0.0058donation 0.0052

XINHUA News, Absolute StrengthXINHUA News, Absolute Strength

Page 17: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

17

Theme Life Cycles: Tsunami

Aid from the world

Research

Aid from China

statistics

Scene and Experiences

$ 0.0173million 0.0135relief 0.0134aid 0.0099U.N. 0.0066 …

China 0.0391yuan 0.0180 Beijing 0.0089 $ 0.0058donation 0.0052

XINHUA News , Normalized StrengthXINHUA News , Normalized Strength

Page 18: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

18

Theme Evolution Graph: KDDT

SVM 0.007criteria 0.007classifica – tion 0.006linear 0.005…

decision 0.006tree 0.006classifier 0.005class 0.005Bayes 0.005…

Classifica - tion 0.015text 0.013unlabeled 0.012document 0.008labeled 0.008learning 0.007…

Informa - tion 0.012web 0.010social 0.008retrieval 0.007distance 0.005networks 0.004…

……

1999

web 0.009classifica –tion 0.007features0.006topic 0.005…

mixture 0.005random 0.006cluster 0.006clustering 0.005variables 0.005… topic 0.010

mixture 0.008LDA 0.006 semantic 0.005…

2000 2001 2002 2003 2004

Page 19: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

19

Theme Life Cycles: KDD

00. 0020. 0040. 0060. 0080. 010. 0120. 0140. 0160. 0180. 02

1999 2000 2001 2002 2003 2004Time (year)

Nor

mal

ized

Stre

ngth

of T

hem

e

Biology DataWeb InformationTime SeriesClassificationAssociation RuleClusteringBussiness

Global Themes life cycles of KDD AbstractsGlobal Themes life cycles of KDD Abstracts

gene 0.0173expressions 0.0096probability 0.0081microarray 0.0038…

marketing 0.0087customer 0.0086model 0.0079business 0.0048…

rules 0.0142association 0.0064support 0.0053…

Page 20: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

20

Summary and Future Work• We defined a new problem of temporal text mining, which

is to discover evolutionary theme patterns

• We proposed an algorithm to extract theme evolution graph and model theme life cycles from text collection

• Experiments on two data sets show that this algorithm is effective to discover interesting ETPs.

• Future Work:– Define a formal evaluation measure and evaluate possible

approaches– Further improve the model, integrate the two parts together,

and adopt prior knowledge– Extend this model to compare theme evolutions in multiple

collections (e.g. KDD proceedings and SIGMOD proceedings)

Page 21: Discovering Evolutionary Theme Patterns from Text  -  An Exploration of Temporal Text Mining

21

Thanks!