context analysis in text mining and search qiaozhu mei department of computer science university of...
TRANSCRIPT
Context Analysis in Text Mining and Search
Qiaozhu MeiDepartment of Computer Science
University of Illinois at Urbana-Champaign
http://sifaka.cs.uiuc.edu/~qmei2, [email protected]
1
Joint work with ChengXiang Zhai
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Motivating Example:Personalized Search
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 2
Mountain safety research
Metropolis Street Racer
Molten salt reactorMars Sample Return
Magnetic Stripe Reader
…
MSR
Actually Looking for Microsoft Research…
University of Illinois at Urbana-Champaign 3
Motivating Example:Comparing Product Reviews
Common Themes “IBM” specific “APPLE” specific “DELL” specific
Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs
Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB
Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz
IBM LaptopReviews
APPLE LaptopReviews
DELL LaptopReviews
Unsupervised discovery of common topics and their variations
2008 © Qiaozhu Mei
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 4
Motivating Example:Discovering Topical Trends in Literature
Unsupervised discovery of topics and their temporal variations
Topic Strength
Time
1980 1990 1998 2003TF-IDF Retrieval
IR Applications
Language Model
Text Categorization
SIGIR topics
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 5
Motivating Example:Analyzing Spatial Topic Patterns
• How do bloggers in different states respond to topics such as “oil price increase during Hurricane Karina”?
• Unsupervised discovery of topics and their variations in different locations
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 6
Motivating Example: Summarizing Sentiments
Unsupervised/Semi-supervised discovery of topics and different sentiments of the topics
time
strength PositiveNegative
Topic-sentiment dynamics (Topic = Price)
Neutral
Query: Dell Laptops
Topic-sentiment summary
positive negative
Facet 2(Battery)
Facet 1(Price)
neutral
• my Dell battery sucks
• Stupid Dell laptop battery
• One thing I really like about this Dell battery is the Express Charge feature.
• i still want a free battery from dell..
• …… • ……
• it is the best site and they show Dell coupon code as early as possible
• Even though Dell's price is cheaper, we still don't want it.
• ……
• mac pro vs. dell precision: a price comparis..
• DELL is trading at $24.66
Motivating Example: Analyzing Topics on a Social Network
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 7
Publications of Gerard Salton
Publications of Bruce Croft
Unsupervised discovery of topics and correlated research communities
Data miningMachine learning
Information retrieval
Bruce Croft
Gerard Salton
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 8
Research Questions
• What are these problems in common?
• Can we model all these problems generally?
• Can we solve these problems with a unified approach?
• How can we bring human into the loop?
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 9
Rest of Talk
• Background: Language Models in Text Mining and Retrieval
• Definition of context
• General methodology to model context
– Models, example applications, results
• Conclusion and Discussion
Generative Models of Text
• Text as observations: words; tags; links, etc
• Use a unified probabilistic model to explain the appearance (generation) of observations
• Documents are generated by sampling every observation from such a generative model
• Different generation assumption different model
– Document Language Models
– Probabilistic Topic Models: PLSA, LDA, etc.
– Hidden Markov Models …
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 10
Multinomial Language Models
11
Known as a Topic model when there are k of them in text:
A multinomial distribution of words as a text representation
retrieval 0.2information 0.15model 0.08query 0.07language 0.06feedback 0.03……
e.g., semi-supervised learning; boosting; spectral clustering, etc.
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Language Models in Information Retrieval (e.g., KL-Div. Method)
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 12
Document d
A text mining paper
data mining
Doc Language Model (LM) θd : p(w|d) text 4/100=0.04
mining 3/100=0.03clustering 1/100=0.01…data = 0computing = 0…
Query q
Data ½=0.5Mining ½=0.5
Query Language Model θq : p(w|q)
Data ½=0.4Mining ½=0.4Clustering =0.1…
?p(w|q’)
text =0.039mining =0.028clustering =0.01…data = 0.001computing = 0.0005… Similarity
function
)|(
)|(log)|()||(
d
Vwdq wp
wpwpD
Smoothed Doc LM θd' : p(w|d’)
13
Probabilistic Topic Models for Text Mining
Text Collections
ProbabilisticTopic Modeling
…web 0.21search 0.10link 0.08 graph 0.05…
…
term 0.16relevance 0.08weight 0.07 feedback 0.04independ. 0.03model 0.03…
Topic models(Multinomial distributions)
PLSA [Hofmann 99]
LDA [Blei et al. 03]
Author-Topic [Steyvers et al. 04]
CPLSA [Mei & Zhai 06]
…
Pachinko allocation[Li & McCallum 06]
CTM[Blei et al. 06]
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Subtopic discovery
Opinion comparison
Summarization
Topical pattern analysis
…
Passage segmentation
Importance of Context
14
• Science in the year 2000 and Science in the year 1500:
Are we still working on the same topics?
• For a computer scientist and a gardener:Does “tree, root, prune” mean the same?
• “Football” means soccer in Europe. What about in US?
Context affects topics!
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 15
Context Features of Text (Meta-data)
Weblog Article
Author
Author’s OccupationLocationTime
communities
source
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 16
Context = Partitioning of Text
1999
2005
2006
1998
…… ……
papers written in 1998
WWW SIGIR ACL KDD SIGMOD
papers written by authors in US
Papers about Web
Rich Context Information in Text
• News articles: time, publisher, etc.
• Blogs: time, location, author, …
• Scientific Literature: author, publication year, conference, citations, …
• Query Logs: time, IP address, user, clicks, …
• Customer reviews: product, source, time, sentiments..
• Emails: sender, receiver, time, thread, …
• Web pages: domain, time, click rate, etc.
• More? entity-relations, social networks, ……
172008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Categories of Context
• Some partitions of text are explicit explicit context
– Time; location; author; conference; user; IP; etc
– Similar to metadata
• Some partitions are implicit implicit context
– Sentiments; missions; goals; intents;
• Some partitions are at document level
• Some are at a finer granularity
– Context of a word; an entity; a pattern; a query, etc.
– Sentences; sliding windows; adjacent words; etc
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 18
Context Analysis
• Use context to infer semantics
– Annotating frequent patterns; labeling of topic models
• Use context to provide targeted service
– Personalized search; intent-based search; etc.
• Compare contextual patterns of topics
– Evolutionary topic patterns; spatiotemporal topic patterns; topic-sentiment patterns; etc.
• Use context to help other tasks
– Social network analysis; impact summarization; etc.
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 19
General Methodology to Model Context
• Context Generative Model
– Observations in the same context are generated with a unified model
– Observations in different contexts are generated with different models
– Observations in similar contexts are generated with similar models
• Text is generated with a mixture of such generative models
– Example Task; Model; Sample results
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 20
Model a unique context with a unified model(Generation)
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 21
Probabilistic Latent Semantic Analysis (Hofmann ’99)
22
A Document d
Topics θ1…k
government
donation
New Orleans
government 0.3 response 0.2..
donate 0.1relief 0.05help 0.02 ..
city 0.2new 0.1orleans 0.05 ..
πd : P(θi|d)
government
donate
new
Draw a word from i
response
aid help
Orleans
Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. …
Choose a topic
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
N DWd,n
θkZd,nπd
K k
kndnd dkzpkzwpdpwdp )|(),|()(),( ,,
πdθk
P(w|θj)
Documents about “Hurricane Katrina”
Example: Topics in Science (D. Blei 05)
242008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
25
Label a Multinomial Topic Model
• Semantically close (relevance)
• Understandable – phrases?
• High coverage inside topic
• Discriminative across topics
term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…
iPod Nano
Pseudo-feedback
Information Retrieval
Retrieval models
じょうほうけんさく – Mei and Zhai 06: a topic in SIGIR
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
26
Automatic Labeling of Topics
Collection (e.g., SIGIR)
term 0.16relevance 0.07weight 0.07
feedback 0.04independence 0.03model 0.03…filtering 0.21collaborative 0.15… trec 0.18
evaluation 0.10…
NLP ChunkerNgram Stat.
information retrieval, retrieval model, index structure, relevance feedback, …
Candidate label pool1
Relevance Score
Information retrieval 0.26retrieval models 0.19IR models 0.17 pseudo feedback 0.06……
2 Discrimination3information retriev. 0.26 0.01retrieval models 0.20IR models 0.18 pseudo feedback 0.09……
4 Coverage
retrieval models 0.20IR models 0.18 0.02 pseudo feedback 0.09……information retrieval 0.01
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
27
Clustering
hash
dimension
algorithm
partition
…
p(w | clustering algorithm )
Good Label (l1)“clustering algorithm”
Clustering
hash
dimension
key
algorithm…
p(w | hash join)
key …hash join… code …hashtable …search…hash join…
map key…hash…algorithm…key
…hash…keytable…join…
l2: “hash join”
Label Relevance: Context Comparison
• Intuition: expect the label with similar context (distribution)
Clustering
dimension
partition
algorithm
hash
Topic
…P(w|)
w
rank
ClwPMIwp )|,()|(
Score (l, ) = D(||l)
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
28
Results: Sample Topic Labels
tree 0.09trees 0.08spatial 0.08b 0.05r 0.04disk 0.02array 0.01cache 0.01
north 0.02case 0.01trial 0.01iran 0.01documents 0.01walsh 0.009reagan 0.009charges 0.007
the, of, a, and,to, data, > 0.02…clustering 0.02time 0.01clusters 0.01databases 0.01large 0.01performance 0.01quality 0.005
clustering algorithmclustering structure
…
large data, data quality, high data,
data application, …
iran contra…
r treeb tree …
indexing methods
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Model different contexts with different models
(Discrimination, Comparison)
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 29
Example: Finding Evolutionary Patterns of Topics
30
T
SVM 0.007criteria 0.007classifica – tion 0.006linear 0.005…
decision 0.006tree 0.006classifier 0.005class 0.005Bayes 0.005…
Classifica - tion 0.015text 0.013unlabeled 0.012document 0.008labeled 0.008learning 0.007…
Informa - tion 0.012web 0.010social 0.008retrieval 0.007distance 0.005networks 0.004…
……
1999
…
web 0.009classifica –tion 0.007features0.006topic 0.005…
mixture 0.005random 0.006cluster 0.006clustering 0.005variables 0.005… topic 0.010
mixture 0.008LDA 0.006 semantic 0.005…
…
2000 2001 2002 2003 2004
KDD
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Content Variations
over Contexts
Example: Finding Evolutionary Patterns of Topics (II)
31
0
0. 002
0. 004
0. 006
0. 008
0. 01
0. 012
0. 014
0. 016
0. 018
0. 02
1999 2000 2001 2002 2003 2004Time (year)
Nor
mal
ized
Str
engt
h of
The
me
Biology Data
Web Information
Time Series
Classification
Association Rule
Clustering
Bussiness
Figure from (Mei ‘05)
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Strength Variations over Contexts
View of Topics: Context-Specific Version of Views
32
One context one viewA document selects from a mix of views
languagemodelsmoothing
query
generation
feedback
mixture
estimateEM
model
pseudo
vectorRocchio
weightingfeedback
term
vectorspace
TF-IDF
Okapi
LSIretrieval
Context 1: 1998 ~ 2006(e.g. After “Language Modeling”)
Context 2: 1977 ~ 1998(i.e. Before “Language Modeling”)
feedback judgeexpansionpseudoquery
Topic 2:
FeedbackTopic 1:
Retrieval Model
retrieve
modelrelevancedocumentquery
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Coverage of Topics: Distribution over Topics
33
Background
• A coverage of topics: a (strength) distribution over the topics.
• One context one coverage• A document selects from a mix of
multiple coverages.
Oil Price
Government Response
Aid and donation
Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. …
Background
Oil PriceGovernment Response
Aid and donation
Context: Texas
Context: Louisiana
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 34
A General Solution: CPLSA
• CPLAS = Contextual Probabilistic Latent Semantic Analysis
• An extension of PLSA model ([Hofmann 99]) by
– Introducing context variables
– Modeling views of topics
– Modeling coverage variations of topics
• Process of contextual text mining
– Instantiation of CPLSA (context, views, coverage)
– Fit the model to text data (EM algorithm)
– Compare a topic from different views
– Compute strength dynamics of topics from coverages
– Compute other probabilistic topic patterns
The “Generation” Process
35
View1 View2 View3
Texas July 2005
sociologist
Context of Document:
Time = July 2005Location = Texas
Author = Eric BrillOccup. = Sociologist
Age = 45+…
Topics
government
donation
New Orleans
government 0.3 response 0.2..
donate 0.1relief 0.05help 0.02 ..
city 0.2new 0.1orleans 0.05 ..
Choose a view
Choose a Coverage
government
donate
new
Draw a word from i
response
aid help
Orleans
Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. …
Choose a theme
Topic coverages:
Texas July 2005 document
……
sociologist
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
An Intuitive Example
• Two topics: web search; machine learning
• I am writing a WWW paper. I will cover more about “web search” instead of “machine learning”.
– But of course I have my own taste.
• I am from a search engine company, so when I write about “web search”, I will focus on “search engine” and “online advertisements”…
36
Coverage
donate 0.1relief 0.05help 0.02 ..
city 0.2new 0.1orleans 0.05 ..
View
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
The Probabilistic Model
),|( CDvp i
),|( CDp j
il
D
D),( 111
))|()|(),|(),|(log(),()(logCD Vw
k
lilj
m
jj
n
ii wplpCDpCDvpDwcp
37
• A probabilistic model explaining the generation of a document D and its context features C: if an author wants to write such a document, he will – Choose a view vi according to the view distribution
– Choose a coverage кj according to the coverage distribution
.
– Choose a theme according to the coverage кj .
– Generate a word using .– The likelihood of the document collection is:
il
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
38
Example results: Query Log AnalysisContext = Days of week
Day-Week Pattern of Search Difficulty
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
9000000
10000000
1 3 5 7 9 11 13 15 17 19 21 23
Jan 2006 (Jan. 1st is a Sunday)
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
Total Clicks
H(Url | IP, Q)
Query & Clicks: more query/clicks on weekdays
Search Difficulty: more difficult to predict on weekends
39
Query Frequency over time
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Jan 2006 (Jan 1st is a Sunday)
Qu
ery
Fre
qu
ency
yahoo
mapquest
cnn
Query Frequency over time
0
0.01
0.02
0.03
0.04
0.05
0.06
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Jan 2006 (Jan 1st is a Sunday)
sex
movie
mp3
Business Queries: clear day-week pattern; weekdays more frequent than weekends
Consumer Queries: no clear day-week pattern; weekends are comparable, even more frequent than weekdays
Query Log AnalysisContext = Type of Query
Bursting Topics in SIGMOD:Context = Time (Years)
0200400600800
10001200140016001800
Sensor dataXML dataWeb dataData StreamsRanki ng, Top-K
402008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Spatiotemporal Text Mining:Context = Time & Location
41
Week4: The theme is again strong along the east coast and the Gulf of Mexico
Week3: The theme distributes more uniformly over the states
Week2: The discussion moves towards the north and west
Week5: The theme fades out in most states
Week1: The theme is the strongest along the Gulf of Mexico
About Government Responsein Hurricane Katrina
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Faceted OpinionsContext = Sentiments
Neutral Positive Negative
Topic 1:Movie
... Ron Howards selection of Tom Hanks to play Robert Langdon.
Tom Hanks stars in the movie,who can be mad at that?
But the movie might get delayed, and even killed off if he loses.
Directed by: Ron Howard Writing credits: Akiva Goldsman ...
Tom Hanks, who is my favorite movie star act the leading role.
protesting ... will lose your faith by ... watching the movie.
After watching the movie I went online and some research on ...
Anybody is interested in it?
... so sick of people making such a big deal about a FICTION book and movie.
Topic 2:Book
I remembered when i first read the book, I finished the book in two days.
Awesome book. ... so sick of people making such a big deal about a FICTION book and movie.
I’m reading “Da Vinci Code” now.
…
So still a good book to past time.
This controversy book cause lots conflict in west society.
422008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Sentiment DynamicsContext = Time & Sentiments
43
Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg )
Facet: the impact on religious beliefs. ( Bursts during the movie, Neg > Pos )
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
“ the da vinci code”
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 44
Event Impact Analysis: IR Research
vector 0.0514concept 0.0298extend 0.0297 model 0.0291space 0.0236boolean 0.0151function 0.0123feedback 0.0077…
xml 0.0678email 0.0197 model 0.0191collect 0.0187judgment 0.0102rank 0.0097subtopic 0.0079…
probabilist 0.0778model 0.0432logic 0.0404 ir 0.0338boolean 0.0281algebra 0.0200estimate 0.0119weight 0.0111…
model 0.1687language 0.0753estimate 0.0520 parameter 0.0281distribution 0.0268probable 0.0205smooth 0.0198markov 0.0137likelihood 0.0059…
1998
Publication of the paper “A language modeling approach to information retrieval”
Starting of the TREC conferences
year1992
term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…
Theme: retrieval models
SIGIR papersSIGIR papers
Model similar context with similar models
(Smoothing, Regularization)
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 45
46
Personalization with Backoff
• Ambiguous query: MSG– Madison Square Garden
– Monosodium Glutamate
• Disambiguate based on user’s prior clicks
• We don’t have enough data for everyone!– Backoff to classes of users
• Proof of Concept:– Classes defined by IP addresses
• Better:– Market Segmentation (Demographics)
– Collaborative Filtering (Other users who click like me)
Context = IP
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 47
),|(
),|(
),|(
),|(
),|(),|(
00
11
22
33
44
QIPUrlP
QIPUrlP
QIPUrlP
QIPUrlP
QIPUrlPQIPUrlP
156.111.188.243
156.111.188.*
156.111.*.*
156.*.*.*
*.*.*.*
Full personalization: every context has a different model: sparse data!
No personalization: all contexts share the same model
Personalization with backoff:
similar contexts have similar
models
48
Backing Off by IP
• λs estimated with EM and CV
• A little bit of personalization– Better than too much
– Or too little
Lambda
0
0.05
0.1
0.15
0.2
0.25
0.3
λ4 λ3 λ2 λ1 λ0
λ4 : weights for first 4 bytes of IP λ3 : weights for first 3 bytes of IPλ2 : weights for first 2 bytes of IP
……
4
0
),|(),|(i
ii QIPUrlPQIPUrlP
Sparse Data Missed Opportunity
Social Network as Correlated Contexts
49
Optimization of Relevance Feedback Weights
Parallel Architecture in IR ...
Predicting query performance
…A Language
Modeling Approach to Information
Retrieval...
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Linked contexts are similar to each other
Social Network Context for Topic Modeling
50
• Context = author
• Coauthor = similar contexts
• Intuition: I work on similar topics to my neighbors
Smoothed Topic distributions over context
e.g. coauthor network
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Evu
k
jjj
jd w
k
jj
vpupvuw
wpdpdwcGCO
, 1
2
1
)))|()|((),(2
1(
))|()|(log),(()1(),(
Topic Modeling with Network Regularization (NetPLSA)
51
• Basic Assumption (e.g., co-author graph)• Related authors work on similar topics
PLSA
Graph Harmonic Regularizer,
Generalization of [Zhu ’03],
Evu
k
jjj
jd w
k
jj
vpupvuw
wpdpdwcGCO
, 1
2
1
)))|()|((),(2
1(
))|()|(log),(()1(),(
importance (weight) of an edge
difference of topic distribution on neighbor vertices
tradeoff betweentopic and smoothness
topic distribution of a document
)|( , 2
1,
...1
upfwhereff jujkj
jTj
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Topical Communities with PLSA
Topic 1 Topic 2 Topic 3 Topic 4
term 0.02 peer 0.02 visual 0.02 interface 0.02
question 0.02 patterns 0.01 analog 0.02 towards 0.02
protein 0.01 mining 0.01 neurons 0.02 browsing 0.02
training 0.01 clusters 0.01 vlsi 0.01 xml 0.01
weighting 0.01
stream 0.01 motion 0.01 generation 0.01
multiple 0.01 frequent 0.01 chip 0.01 design 0.01
recognition 0.01 e 0.01 natural 0.01 engine 0.01
relations 0.01 page 0.01 cortex 0.01 service 0.01
library 0.01 gene 0.01 spike 0.01 social 0.01
52
?? ? ?
Noisy community assignment
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Topical Communities with NetPLSA
53
Topic 1 Topic 2 Topic 3 Topic 4
retrieval 0.13 mining 0.11 neural 0.06 web 0.05
information 0.05 data 0.06 learning 0.02 services 0.03
document 0.03 discovery 0.03 networks 0.02 semantic 0.03
query 0.03 databases 0.02 recognition 0.02 services 0.03
text 0.03 rules 0.02 analog 0.01 peer 0.02
search 0.03 association 0.02 vlsi 0.01 ontologies 0.02
evaluation 0.02 patterns 0.02 neurons 0.01 rdf 0.02
user 0.02 frequent 0.01 gaussian 0.01 management 0.01
relevance 0.02 streams 0.01 network 0.01 ontology 0.01
Information Retrieval
Data mining Machine learning
Web
Coherent community assignment
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Smoothed Topic Map
54
Map a topic on the network (e.g., using p(θ|a))
PLSA(Topic : “information retrieval”)
NetPLSA
Core contributors
Irrelevant
Intermediate
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Smoothed Topic Map
55
The Windy States-Blog articles: “weather”-US states network:-Topic: “windy”
PLSA NetPLSA
Real reference
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 56
Related Work
• Specific Contextual Text Mining Problems– Multi-collection Comparative Mining (e.g., [Zhai et al. 04])
– Temporal theme pattern (e.g., [Mei et al. 05], [Blei et al. 06], [Wang et al. 06])
– Spatiotemporal theme analysis (e.g., [Mei et al. 06], [Wang et al. 07])
– Author-topic analysis (e.g., [Steyvers et al. 04], [Zhou et al 06])
– …
• Probabilistic topic models:– Probabilistic latent semantic analysis (PLSA) (e.g. [Hofmann 99])
– Latent Dirichlet allocation (LDA) (e.g., [Blei et al. 03])
– Many extensions (e.g., [Blei et al. 05], [Li and McCallum 06])
Conclusions
• Context analysis in text mining and search
• General methodology to model context in text
– A unified generative model for observations in the same context
– Different models for different context
– Similar models for similar contexts
– Generation discrimination smoothing
• Many applications
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 57
Discussion: Context in Search
• Not all contexts are useful
– E.g. personalized search v.s. search by time of day
– How can we know which contexts are more useful?
• Many contexts are useful
– E.g., personalized search; task-based search; localized search;
– How can we combine them?
• Can we do better than market segmentations?
– Backoff to users who search like me – Collaborative Search
– But who searches like you?
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 58
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 59
References• CPLSA
– Q. Mei, C. Zhai. A Mixture Model for Contextual Text Mining, In Proceedings of KDD' 06.
• NetPLSA– Q. Mei, D. Cai, D. Zhang, C. Zhai, Topic Modeling with Network Reguarization,
Proceedings of WWW’ 08
• Labeling– Q. Mei, X.Shen, C. Zhai, Automatic Labeling of Multinomial Topic Models, Proceedings
KDD'07
• Personalization: – Q.Mei, K.Church, Entropy of Search Logs: How Hard is Search? With Personalization?
With Backoff? In Proceedings of WSDM’08.
• Applications:– Q. Mei, C. Zhai, Discovering Evolutionary Theme Patterns from Text - An Exploration of
Temporal Text Mining, In Proceedings KDD' 05
– Q. Mei, C. Liu, H. Su, and C. Zhai, A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs, In Proceedings of WWW' 06
– Q. Mei, X. Ling, M. Wondra, H. Su, C. Zhai, Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs, Proceedings of WWW’ 07
Experiments
• Bibliography data and coauthor
networks
– DBLP: text = titles; network = coauthors
– Four conferences (expect 4 topics): SIGIR, KDD, NIPS, WWW
• Blog articles and Geographic network
– Blogs from spaces.live.com
containing topical words, e.g. “weather”
– Network: US states (adjacent states)
612008 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Coherent Topical Communities
62
Semantics of community: “Data Mining (KDD) ”
NetPLSA
mining 0.11
data 0.06
discovery 0.03
databases 0.02
rules 0.02
association 0.02
patterns 0.02
frequent 0.01
streams 0.01
PLSA
peer 0.02
patterns 0.01
mining 0.01
clusters 0.01
stream 0.01
frequent 0.01
e 0.01
page 0.01
gene 0.01
PLSA
visual 0.02
analog 0.02
neurons 0.02
vlsi 0.01
motion 0.01
chip 0.01
natural 0.01
cortex 0.01
spike 0.01
NetPLSA
neural 0.06
learning 0.02
networks 0.02
recognition 0.02
analog 0.01
vlsi 0.01
neurons 0.01
gaussian 0.01
network 0.01
Semantics of community: “machine learning (NIPS)”
2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign