yin yang (hong kong university of science and technology) nilesh bansal (university of toronto)...
TRANSCRIPT
![Page 1: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/1.jpg)
Yin Yang (Hong Kong University of Science and Technology)Nilesh Bansal (University of Toronto)Wisam Dakka (Google)Panagiotis Ipeirotis (New York University) Nick Koudas (University of Toronto)Dimitris Papadias (Hong Kong University of Science and Technology)
![Page 2: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/2.jpg)
Explosion of Web 2.0 content blogs, micro-blogs, social networking
Need for “cross reference” on the web after we read a news article, we wonder
if there are any blogs discussing it and vice versa
![Page 3: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/3.jpg)
A service of the BlogScope system a real blog search engine serving 20K
users /day Input: a text documentOutput: relevant blog postsMethodology
extract key phrases from the input document
use these phrases to query BlogScope
![Page 4: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/4.jpg)
![Page 5: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/5.jpg)
![Page 6: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/6.jpg)
Novel Query-by-Document (QBD) model
Practical phrase extractorPhrase set enhancement with
Wikipedia knowledge (QBD-W)Evaluation of all proposed methods
using Amazon Mechanical Turk Human annotators are serious because
they get paid for the tasks
![Page 7: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/7.jpg)
Example of RF
Distinctions between RF and QBD RF involves interaction, while QBD does not RF is most effective for improving recall,
whereas QBD aims at both high precision and recall
RF starts with a keyword query; QBD directly takes a document as input
![Page 8: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/8.jpg)
Two classes of methods Very slow but accurate, from the machine
learning community Practical, not so accurate as the above (our
method falls in this category) Phrase extraction in QBD has distinct
goals Document retrieval accuracy is more
important than that of the phrase set itself A better phrase extractor is not necessarily
more suitable for QBD, as shown in our experiments
![Page 9: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/9.jpg)
Query expansion Used when user’s keyword set does not
express herself properlyPageRank, TrustRank, …
QBD-W follows this frameworkWikipedia mining
![Page 10: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/10.jpg)
Recall that Query-by-Document Extracts key phrases from the input
document And then query them against a search
engine Idea: given a query document D
Identify all phrases from D Score each individual phrase Obtain the set of phrases with highest
scores, and refine it
![Page 11: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/11.jpg)
Process the document with a Part-of-Speech tagger Nouns, adjectives, verbs, …
We compiled a list of POS patterns Indexed by a POS trie forest Each term sequence following such a
POS pattern is considered a phrase
![Page 12: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/12.jpg)
Pattern Instance
N Nintendo
JN global warming
NN Apple computer
JJN declarative approximate selection
NNN computer science department
JCJN efficient and effective algorithm
JNNN Junior United States Senator
NNNN Microsoft Host Integration Server
… …
NNNNN United States President Barrack Obama
![Page 13: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/13.jpg)
![Page 14: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/14.jpg)
Two scoring functions ft, based on TF/IDF
fl, based on the concept of mutual information
![Page 15: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/15.jpg)
| |
1
( ) ( ) ( )c
t ii
f c tfidf w coherence c
| |
1
( ) 1 log ( )( )
1( )
| |
c
ii
tf c tf ccoherence c
tf wc
Extract the most characteristic phrases from the input document D
But may obtain term sequences which are not really phrases Example: “moment Down Jones” in “at
this moment Dow Jones”
![Page 16: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/16.jpg)
MI: the conditional probability of a pair of events, with respect to their individual probabilities
Eliminates non-phrases
( , )( , ) log
( ) ( )
prob x yPMI x y
prob x prob y
} |
1
( )( ) log
( )c
ii
prob cPMI c
prob w
( )( )
( )c
tf cprob c
tf POS ( )
( )( )
i
ii
w
tf wprob w
tf POS
| |
1( ) ( ) ( ) log ( ) ( )
c
l iif c prob c idf w prob c PMI c
![Page 17: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/17.jpg)
Take the top-k phrases with highest scores
Eliminates duplicates Two different phrases may carry similar
meanings Remove phrases who are▪ Subsumed by another with higher score▪ Differ from a better phrase only in the last
term▪ And other rules …
![Page 18: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/18.jpg)
Motivation: The user may also be interested in web
documents related to the given one, but does not contain the same key phrases
Example: after reading an article on Michelle Obama, the user may also want to learn her husband, and past American presidents
Main idea: Obtain an initial phrase set with QBD Use Wikipedia knowledge to identify phrases
that are related to the initial phrases Our method follows the spreading-activation
framework
![Page 19: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/19.jpg)
![Page 20: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/20.jpg)
Given an initial phrase set Locate nodes corresponding to these
phrases on the Wiki Graph Assign weights to these nodes Iteratively spreads node weights to
neighbors▪ Assume the random surfer model▪ With a certain probability, return to one of the
initial nodes
![Page 21: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/21.jpg)
S is the initial phrase set Initial weights are normalizeds(cv) is the score of cv, assigned by
QBD
0''
( )if
( )( )
0 otherwise
v
vv S
s cv S
s cRR v
![Page 22: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/22.jpg)
Wii Sony Nintendo
Play Station
Tomb Raider
Wii 0 2/10 7/10 1/10 0
Sony 0 0 0 4/4 0
Nintendo
5/6 1/6 0 0 0
Play Station
2/11 6/11 1/11 0 2/11
Tomb Raider
0 0 0 1/1 0
'' ,
if , '[ , ']
0 otherwise
e
ee v w
wte v v E
wtT v v
![Page 23: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/23.jpg)
With probability αv’ , proceed to a neighbor;
Otherwise, return to one of the initial nodes
αv’ is a function of the node v’
1 0 1' ' ' '
', ',
[ ', ] (1 )i i iv v v v v v
e v v e v v
RR RR T v v RR RR
![Page 24: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/24.jpg)
αv is not a constant, unlike other algorithms (e.g., TrustRank)
αv gets smaller, and eventually drops to zero, for nodes increasingly farther away from the initial ones Reduce CPU overhead of RelevanceRank
computation, since only a subset of nodes are considered
Important, as RelevanceRank is calculated online
![Page 25: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/25.jpg)
Iteration Wii Sony Nintendo Play Station
0 0 0 1 0
1 0.67 0.13 0.1 0
2 0.13 0.06 0.74 0.06
3 0.49 0.11 0.38 0.02
4 0.25 0.08 0.62 0.05
5 0.41 0.10 0.46 0.03
… … … … …
Infinite 0.35 0.09 0.52 0.03
![Page 26: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/26.jpg)
Methodology Employ human annotators at Amazon
Mturk Dataset
A random sample of news articles from the New York Times, the Economist, Reuters, and Financial Times during Aug-Sep 2007
Competitors for phrase extraction QBD-TFIDF (tf-idf scoring) QBD-MI (mutual information scoring) QBD-YAHOO (Yahoo! phrase extractor)
![Page 27: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/27.jpg)
Quality of Phrase RetrievalQuality of Document RetrievalEfficiency
The total running time of QBD is negligible
![Page 28: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/28.jpg)
![Page 29: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/29.jpg)
![Page 30: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/30.jpg)
![Page 31: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/31.jpg)
![Page 32: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/32.jpg)
lmax Time (seconds)
1 0.160
2 1.142
3 10.262
4 57.915
5 143.828
![Page 33: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/33.jpg)
We propose the query-by-document model two effective phrase extraction algorithms enhancing the phrase set with the
Wikipedia graph Future work
more sophisticated phrase extraction (e.g., with additional background knowledge)
blog matching using key phrases
![Page 34: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)](https://reader036.vdocuments.mx/reader036/viewer/2022062716/56649dca5503460f94ac05c6/html5/thumbnails/34.jpg)