discovering evolutionary theme patterns from text - an exploration of temporal text mining
DESCRIPTION
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining. Qiaozhu Mei, ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign U.S.A. Motivation. Most text collections bear time stamps - PowerPoint PPT PresentationTRANSCRIPT
1
Discovering Evolutionary Theme Patterns from Text
- An Exploration of Temporal Text Mining
Qiaozhu Mei, ChengXiang Zhai
Department of Computer ScienceUniversity of Illinois at Urbana-Champaign
U.S.A
2
Motivation
• Most text collections bear time stamps– News articles, scientific literature, emails, etc.
• Many useful temporal patterns exist– Emerging topics/themes– Decaying topics/themes– Topic evolution thread– Topic/theme life cycles– …
• How do we discover and exploit such patterns?
3
Theme Evolution Graph (Asia Tsunami)
Immediate Reports
Statistics of Death and loss
Personal Experience of Survivors
Statistics of further impact
Aid from Local Areas Aid from the world
Donations from countries
Specific Events of Aid
…
…Lessons from Tsunami Research inspired
Time
Doc1Doc3 Doc ..
Theme spans Evolutionary transitions
Theme evolution thread
• Useful for summarizing the news…
4
Theme Life Cycle (SIGIR Proceedings)
• Useful for revealing historical trends and hot topics…
Theme Strength
Time1980 1990 1998 2003
TF-IDF Retrieval
IR Applications
Language ModelText Categorization
5
Problem Definition
• Evolutionary Theme Pattern (ETP)– Theme Evolution Graph
• A weighted directed graph in which each vertex is a theme span and each edge is an evolutionary transition
– Theme Life Cycle• The strength of a theme over the whole time line
• Given a text collection with time stamps, the problem of discovering ETP is to – Extract a theme evolution graph– Model the life cycles of the most salient themes
6
Research Questions
• How to represent a theme?
• How to extract themes from a collection automatically?
• How to model the transitions of themes?
• How to segment the collection with themes?
• How to model and compute the strength of each theme at a given time period?
7
Our Approach
tt11
12
13
21
22
31
3k
PartitioningPartitioning
Theme Theme Evolution Evolution GraphGraph
Extracting Extracting global global salient salient themesthemes
…… ……
θθ11 θθ22
θθ33
BB
…… ……
Model Model theme theme shiftsshifts
Decoding Decoding CollectionCollection
s
tt
Theme Life cyclesTheme Life cycles
tt
Theme Theme spans spans extractionextraction
……
Collection with Collection with time stampstime stamps
Task I. Theme ExtractionTask I. Theme Extraction
Task II. Transition Task II. Transition ModelingModeling
Task III. Theme Task III. Theme SegmentationSegmentation
Model theme Model theme transitionstransitions
Computing Theme Computing Theme StrengthStrength
t1 t2 t3, …, t
8
Our Approach (Cont.)
• Extracting Theme Evolution Graph– Partition collection into time intervals– Extract themes from each time span (task I)– Model transitions between theme spans (task II)
• Modeling theme life cycles– Extract most salient themes from the whole
collection (task I)– Segment the collection with themes (task III )– Compute the strength of each theme over time
9
Task I: Theme Extraction• There are k themes in the collection (or a time span), each document
is a sample of words generated by multiple themes
• Infer the best theme language models that fit our data
Theme 1
Theme k
Theme 2
…
Background B
warning 0.3 system 0.2..
Aid 0.1donation 0.05support 0.02 ..
statistics 0.2loss 0.1dead 0.05 ..
Is 0.05the 0.04a 0.03 ..
Document d
k
1
2
BB
W
d,1
d, k
1 - Bd,2
“Generating” word w in doc d in the collection
?
??
???
Parameters: B=noise-level (manually set)’s and ’s are estimated with Maximum Likelihood
,1
( : ) ( | ) (1 ) [ ( | )]k
B B B d j jj
p w d p w p w
10
Task II: Transition Modeling
• Theme spans in an earlier time interval could evolve into theme spans in a later time interval
Tt1 … t2
A
C?
B?microarray 0.2gene 0.1protein 0.05
web 0.3classification 0.1topic 0.1
Information 0.2topic 0.1 classification 0.1text 0.05
Evolutionary Transition
Theme similarity=
• Similarity/distance between two theme spans is modeled with KL Divergence between two distributions
11
Task III: Theme Segmentation• View the whole collection as a sequence ordered by time,
Model the theme shifts in documents with a Hidden Markov Model
Decoding Decoding CollectionCollectionTheme 1
Theme 3
Theme 2
Background
……
The Collection
θθ11 θθ22
θθ33
BB
output probability P (w|θ)=
Train transition probabilities
w ww ww ww ww ww ww ww ww ww
12
Our Approach: Revisit
tt11
12
13
21
22
31
3k
PartitioningPartitioning
Theme Theme Evolution Evolution GraphGraph
Extracting Extracting global global salient salient themesthemes
…… ……
θθ11 θθ22
θθ33
BB
…… ……
Model Model theme theme shiftsshifts
Decoding Decoding CollectionCollection
s
tt
Theme Life cyclesTheme Life cycles
tt
Theme Theme spans spans extractionextraction
……
Collection with Collection with time stampstime stamps
Model theme Model theme transitionstransitions
Computing Theme Computing Theme StrengthStrength
t1 t2 t3, …, t
13
Experiments
• Two data sets:– Asia Tsunami: 7468 news articles spanning 50 days
from 10 news sources – KDD Abstracts: 496 abstracts from 6 years’ KDD
conference proceedings
• On each data set, we extract a theme evolution graph and model the life cycles of global salient themes
14
Theme Evolution Graph: TsunamiT
aid 0.020relief 0.016U.S. 0.013military 0.011U.N. 0.011…
Bush 0.016U.S. 0.015$ 0.009relief 0.008million 0.008…
Indonesian 0.01military 0.01islands 0.008foreign 0.008aid 0.007…
system 0.0104Bush 0.008warning 0.007conference 0.005US 0.005…
system 0.008China 0.007warning 0.005Chinese 0.005…
warning 0.012system 0.012Islands 0.009Japan 0.005quake 0.003……
…
……
……
12/28/04 01/05/05 01/15/05 …
…
15
Theme Life Cycles: Tsunami
Aid from the world
ResearchAid for children
statistics
Personal experiences
$ 0.0173million 0.0135relief 0.0134aid 0.0099U.N. 0.0066 …
I 0.0322wave 0.0061beach 0.0051saw 0.0046sea 0.0046 …
CNN, Absolute StrengthCNN, Absolute Strength
16
Theme Life Cycles: Tsunami
Aid from the world
Research
Aid from China
statistics
Scene and Experiences
dollars 0.0226million 0.0204aid 0.0118U.N. 0.0102reconstruction0.0062 …
China 0.0391yuan 0.0180 Beijing 0.0089 $ 0.0058donation 0.0052
…
XINHUA News, Absolute StrengthXINHUA News, Absolute Strength
17
Theme Life Cycles: Tsunami
Aid from the world
Research
Aid from China
statistics
Scene and Experiences
$ 0.0173million 0.0135relief 0.0134aid 0.0099U.N. 0.0066 …
China 0.0391yuan 0.0180 Beijing 0.0089 $ 0.0058donation 0.0052
…
XINHUA News , Normalized StrengthXINHUA News , Normalized Strength
18
Theme Evolution Graph: KDDT
SVM 0.007criteria 0.007classifica – tion 0.006linear 0.005…
decision 0.006tree 0.006classifier 0.005class 0.005Bayes 0.005…
Classifica - tion 0.015text 0.013unlabeled 0.012document 0.008labeled 0.008learning 0.007…
Informa - tion 0.012web 0.010social 0.008retrieval 0.007distance 0.005networks 0.004…
……
1999
…
web 0.009classifica –tion 0.007features0.006topic 0.005…
mixture 0.005random 0.006cluster 0.006clustering 0.005variables 0.005… topic 0.010
mixture 0.008LDA 0.006 semantic 0.005…
…
2000 2001 2002 2003 2004
19
Theme Life Cycles: KDD
00. 0020. 0040. 0060. 0080. 010. 0120. 0140. 0160. 0180. 02
1999 2000 2001 2002 2003 2004Time (year)
Nor
mal
ized
Stre
ngth
of T
hem
e
Biology DataWeb InformationTime SeriesClassificationAssociation RuleClusteringBussiness
Global Themes life cycles of KDD AbstractsGlobal Themes life cycles of KDD Abstracts
gene 0.0173expressions 0.0096probability 0.0081microarray 0.0038…
marketing 0.0087customer 0.0086model 0.0079business 0.0048…
rules 0.0142association 0.0064support 0.0053…
20
Summary and Future Work• We defined a new problem of temporal text mining, which
is to discover evolutionary theme patterns
• We proposed an algorithm to extract theme evolution graph and model theme life cycles from text collection
• Experiments on two data sets show that this algorithm is effective to discover interesting ETPs.
• Future Work:– Define a formal evaluation measure and evaluate possible
approaches– Further improve the model, integrate the two parts together,
and adopt prior knowledge– Extend this model to compare theme evolutions in multiple
collections (e.g. KDD proceedings and SIGMOD proceedings)
21
Thanks!