learning to cluster web search results sigir 04. abstract organizing web search results into...

21
Learning to Cluster Web Search Results SIGIR 04

Upload: beverley-copeland

Post on 15-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

Learning to Cluster Web Search Results

SIGIR 04

Page 2: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

ABSTRACT

Organizing Web search results into clusters facilitates users quick browsing through search results.

Traditional clustering techniques They don’t generate clusters with highly readable names. Need pre-defined categories as in classification method.

Based on a regression model learned from human labeled training data, convert an unsupervised clustering problem to a supervised learning problem.

Page 3: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

INTRODUCTION

User submits query “jaguar” into Google Results related to “big cat”, user should go to the 10th,

11th,32nd and 71st results.

A possible solution to this problem is to online cluster search result into different groups.

Ranking salient phrases as cluster names. Re-formalize the clustering problem as a salient phras

es ranking problem.

Page 4: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

INTRODUCTION

Salient phrases

Titles and snippets

*Real demonstration of this technique http://vivisimo.com/

Page 5: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

INTRODUCTION

Leouski A. V. and Croft W. B. An Evaluation of Techniques for Clustering Search Results. Technical Report IR-76, Department of Computer Science, 1996.

Zamir O., Etzioni O. Web Document Clustering (SIGIR'98), 1998. Zamir O., Etzioni O. Grouper: A Dynamic Clustering Interface to Web Search Results. (W

WW8),1999. Leuski A. and Allan J. Improving Interactive Retrieval by Combining Ranked List and Clust

ering. Proceedings of RIAO, 2000. Liu B., Chin C. W., and Ng, H. T. Mining Topic-Specific Concepts and Definitions on the W

eb. (WWW'03), 2003

Trainingdata

Combinedproperties

RegressionMethod

Singlesalience

score

RankingPhrases

Top ranked phrases=Salient phrases=Candidate cluster Name

Page 6: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

Problem Formalization And Algorithm

Problem Formalization: Ranked list of search result :

q : current query , di : document r : some (unknown) function calculate the probability

To find a set of topic-coherent clusters on query q (Traditional):

To find a ranked list of clusters C’,with each cluster associated with a cluster name as well as a new ranked list of documents:

Algorithm:four steps Search result fetching, Document parsing and phrase property calculation Salient phrase ranking,and Post-processing

Page 7: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

Salient Phrases Extraction 1/3

Five properties: 1.Phrase frequency / Inverted document frequency (T

FIDF)

w: current phrase , D(w) : the set of documents that contains w. 2.Intra-Cluster Similarity (ICS)

Documents into vector space model: di=(xi1,xi2,…). Each component of the vectors is weighted by TFIDF

For each candidate cluster calculates its centroid as:

ICS is calculate as:

)(log)(

wD

NwfTFIDF

)(|)(|

1

wDdi

i

dwD

o

)(

),cos()(|

1

wDdi

i

odwD

ICS

Page 8: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

Salient Phrases Extraction 2/3

3.Phrase Length (Len) Example: Len(big)=1 , Len(big cats)=2.

4.Cluster Entropy (CE) For given phrase w, the corresponding document set D(w) might over

laps with other D(wi) where wi != w. One extreme :

Too general phrase to be a good salient phase. Other extreme :

D(w) seldom overlap with D(wi) , w may have some distance meaning.

Examples: Take query “jaguar” as an example , “big cat” seldom co-occur with other salient keywords such as “car”, “mac os”,etc.

Page 9: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

Salient Phrases Extraction 3/3

5.Phrase Independence ** (IND) A phrase is independent when the entropy of its context is high.

)(

)(log

)(

wltl TF

tf

TF

tfIND

** Chien L. F. PAT-Tree-Based Adaptive Keyphrase Extraction for Intelligent Chinese Information Retrieval. (SIGIR'97),1997.

2rl INDIND

IND

Page 10: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

Learning to Rank Salient Phrases 1/3

Regression is a classic statistical problem which tries to determine the relationship between two random variables x=(x1,x2,…,xp) and y. X=(TFIDF,LEN,ICS,CE,IND) Y can be any real-valued score.

Linear Regression :

Residual e is a random variable. The coefficients are determined by the condition that t

he sum of the square residuals is as small as possible.

p

jjj exbby

10

)0( pjb j

Page 11: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

Learning to Rank Salient Phrases 2/3

Logistic Regression: When the dependent variable Y is dichotomy, logistic regression i

s more suitable. Because we want to predict is not a precise numerical value of a

dependent variable, but rather the probability. Whereas q can only range from 1 to 0 Logit(q) ranges from negative infinity to positive infinity.

p

jjj cxbb

q

qqit

101

log)(log

Page 12: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

Learning to Rank Salient Phrases 3/3

Support vector Regression : Input x is first mapped on a high dimensional feature

space using some nonlinear mapping. -insensitive loss function:

SV regression tries to minimize ||||2

***Joachims T., Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning. Schölkopf B. and Burges C. and Smola A. (ed.), MIT-Press, 1999.

Page 13: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

Experiments

Default result numbers from search engines are set to 200.

Evaluation Measure: Traditional clustering algorithm is difficult to be evaluated. In this approach, evaluation is relatively easy because the

problem is defined to be a ranking problem. Using classical evaluation method in Information Retrieval.

P@N : precision at top N result

R : set of top N salient keywords.

C : set of manually tagged correct

salient keywords.

Page 14: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

Experiments

Training Data Collection: 3 human evaluators to label ground truth data for 30

queries. Selected from one day’s query log from MSN.

Page 15: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

Experiments - Training Data Collection:

For each query extract all the n-gram(n<=3) from the search results as candidate phrases.

3 evaluators selected the candidates: 10 “good phrases” ( assign score 100) 10 “medium phrases” (assign score 50) Other phrases are zero score.

Finally,three score add together and assign 1 to the y values of phrases with score greater than 100, and assign 0 to the y values of others.

Page 16: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

Experimental Results

Property Comparison:

Page 17: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

Experimental Results

Learning methods comparison: Three-fold cross validation to evaluate 3 regression method

Page 18: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

Experimental Results

Page 19: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

Experimental Results

Page 20: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search

CONCLUSION AND FUTURE WORKS

Several properties, as well as several regression models, are proposed to calculate salience score for salient phrase.

Clusters with short names hopefully is more readable,could improve user’s browsing efficiency through search result.

In the future works: To extract syntactic features for keywords and phrases to assist

the salient phrase ranking. Hierarchical structure of search results is necessary for more

efficient browsing. Some external taxonomies such as Web directories contains

much knowledge, thus a combination of classification and clustering might be helpful in this application.

Page 21: Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search