Download - Query Suggestion
Query Suggestion
Naama Kraus
Slides are based on the papers:Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clusteringBoldi, Bonchi, Castillo, Donato, Vigna, The Query Flow Graph: Model and Applications
Ambiguous queries: jaguar
General queries:haifa
Terminology differences (synonyms)between user and corpusstars - planets
The Problem• User queries are an imperfect description of their information
needs• Examples:
Query Suggestions
Assist the user to phrase her information need
jaguar
Jaguar carJaguar xfJaguar animalJaguar cat
Example: Google Related Searches
Query suggestion algorithms• Query suggestions are extracted from the
query log– There are methods that use different data sources
such as a corpus, not covered today
• Topic (cluster) based – identify groups of similar queries
• Sequence based – mine and analyze the query log for likely query sequences
Improving Search Engines by Query Clustering - Baeza-Yates et al.
• Algorithm outline• Offline:
– Represent queries as term weighted vectors– Cluster queries– Rank queries in each cluster
• Online:– Given user’s query q– Find cluster C containing q– Suggest top k queries in cluster C
• Based on their rank and similarity to q
Query Model
• Given query q• Let U be the set of URLs clicked for q (for all
users and sessions)– Information is extracted from the query log
• q’s term weighted vector has a non 0 entry for any term that appears in some URL in U
• Terms are weighted according to – Term frequency and URLs popularity– Formula in next slide …
Query Model (2)
- The number of clicks of u for the query q
Note: paper proposes a refinement to Pop(u,q) which is notbiased by search engine’s ranking
Query similarity is computed by some measure, e.g. cosine similarity.
Query Support
• The fraction of the documents returned by the query that captured the attention of users (clicked documents)
• Denotes how ‘good’ is a query– A ‘global score’
• Queries within a cluster are ranked according to their similarity to q as well as their support
Query Flow Graph – Boldi et al.
• Main idea:• Aggregate the (massive) raw data in the query
log– Many queries of many users
• Model user query behavior• Use sophisticated techniques to infer query
relatedness
Query Flow Graph Model
• G=(V, E, w) a directed graph where:• V – nodes, representing a distinct set of
queries Q– Queries are extracted from the query log
• A set of directed edges E• Two queries q,q’ are connected with an edge
if q’ follows q in at least one session
QFG Illustration
q0
q1
q2
q3
q4
q5
Nodes are queriesEdges connect between queries
apple ipod
applestore
Weighting Function
• w : E -> (0..1] a weighting function that assigns a weight to every edge (q,q’)
• For each edge (q,q’) assign a probability that q’ follows q in the same session– Extracted from the observed query log sessions
'
( , ')( , ') , ( ) ( , ')( ) q
count q qw q q d q count q qd q
Illustration
q0
q1
q2
q3
0.5
0.25
0.25
q4
q5
0.1
0.55
0.35 0.2
0.8
1.0
1.0
Random walk on the QFG
• A random surfer executes a random walk on the graph as follows:– Start at a some node– Move along an edge with probability d
• Choose an edge by its probability (weight)– Or teleport to a random node with probability 1-d
• Choose an edge uniformlyThe Stationary distribution
The probability to be at node q in the infinity Random walk score vector – query absolute scores
Random Walk Relative to a Node
• Random walk with restart to a single node:– Start at node q– Instead of teleporting to any node, always teleport
to q• The score of node q’ for this random walk
measures relatedness of q’ to q– The probability to get from q to q’ in the infinity– Can normalize node’s relative score by its absolute
score ; similar somehow to tfxidf – avoid highly popular queries (non related to q)
The Full Picture
• Off-line stage– For each node q in the graph
• Compute the stationary distribution vector of q– A random walk score relative to q
• Store suggestions for q, alternatives:– top k scored nodes– nodes having a score above some threshold
• On-line stage– User submits query q– Suggest queries stored for q
• Queries most related to q