query log analysis

23
Query Log Analysis Naama Kraus des are based on the papers: rei Broder, A taxonomy of web search rdo Baeza-Yates, Graphs from Search Engine Queries san, Jones, Klinkner, ond DCG: User Behavior as a Predictor of a Successful Search

Upload: tamah

Post on 22-Feb-2016

49 views

Category:

Documents


0 download

DESCRIPTION

Query Log Analysis. Naama Kraus. Slides are based on the papers: Andrei Broder , A taxonomy of web search Ricardo Baeza -Yates , Graphs from Search Engine Queries Hassan, Jones, Klinkner , Beyond DCG: User Behavior as a Predictor of a Successful Search. A Taxonomy of Web Searches. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Query Log Analysis

Query Log Analysis

Naama Kraus

Slides are based on the papers:Andrei Broder, A taxonomy of web searchRicardo Baeza-Yates, Graphs from Search Engine QueriesHassan, Jones, Klinkner, Beyond DCG: User Behavior as a Predictor of a Successful Search

Page 2: Query Log Analysis

A Taxonomy of Web Searches

• [Andrei Broder] classifies web queries according to their intent:– Navigational - reach a particular site

• Example: cnn , Oracle– Informational - acquire some information

• Example: the history of haifa , information retrieval– Transactional - perform some web-mediated

activity. Further interaction is expected.• E.g. shopping, downloading files, accessing databases• Example: new balance shoes , Israel flights

Page 3: Query Log Analysis

Query Log

• Search Engine Query Log records users’ searches

• A typical record contains– Anonymous User id u– Search query q– Returned documents V– Clicked documents C– Timestamp t

Page 4: Query Log Analysis

Query Log Example

1234 , apple, 12:041234, apple ipod, 12:051234 ynet, 12:13145 google, 12:20145 eBay, 12:5632 ynet news, 12:59145 Solaris systen, 13:01145 Solaris system, 13:05…

Page 5: Query Log Analysis

Session

• A sequence of searches of one particular user u within a specific time limit

• S = < <u, q1 ,t1> , …, <u, qk, tk> >• t1 < …< tk (=> ordered sequence)• ti+1 – ti < t0 (=> t0 is a timeout threshold)

• Note1 may contain non related queries• Note2 identifying sessions is easy

Page 6: Query Log Analysis

Session Example

• 1234 , apple, 12:04• 1234, apple ipod, 12:05• 1234 ynet, 12:13• 1234 apple store, 12:20• 1234 cnn news, 12:56• 1234 cnn webcast,

12:59• 1234 apple apps, 13:01

• Session 1• Session 2• Timeout threshold = 30

minutes

Page 7: Query Log Analysis

Query Chain

• A sequence of queries with a similar information need of a particular user– Also known as mission or logical session

• Example: haifa maps haifa travel attractions in haifa

• Note1 contains related queries only• Note2 identifying chains is difficult

Page 8: Query Log Analysis

Query Chain Example

• 1234 , apple, 12:04• 1234, apple ipod, 12:05• 1234 ynet, 12:13• 1234 apple store, 12:20• 1234 cnn news, 12:56• 1234 cnn webcast,

12:59• 1234 apple apps, 13:01

• chain1• chain2

Page 9: Query Log Analysis

Click Graph

Bipartite graphNodes in left side are unique queriesNodes in right side are unique URLs

An edge between q,u if there existsin the log a click on u for query q

Edges may be weighted according tonumber of clicks

This graph is used by numerousAlgorithm for various purposesE.g., query and URL clustering,query recommendations …

Page 10: Query Log Analysis

Query Graphs

Each unique query isa node in the graph

Next slides – Connection types between queries(edges)

Proposed by[Ricardo Baeza-Yates]

Page 11: Query Log Analysis

Query Graphs – Word Graph

An edge between nodesexists, if queries sharecommon terms

Possible node weight –Number of occurrencesin the log

Possible edge weight -Jaccard distance

paris hotels

cheap paris hotels

paris attractions

london attractions

Page 12: Query Log Analysis

Query Graphs – Session Graph

Node’s q weight is the number ofsessions that contain the query q (usually equalsnumber of query occurrences)

A directed edge from q1 to q2if q1 occurred before q2 in the same session

Edge’s weight is numberof such occurrences

paris hotels

paris attractions

cheap paris hotels

london attractions

Page 13: Query Log Analysis

Query Graphs – URL Cover Graph

paris hotels

paris attractions

cheap paris hotels

london attractions

An edge exists between q1and q2, if they share clicked URLs

Node weight = #occurrences

Edge’s weight is the number ofcommon clicks

Page 14: Query Log Analysis

Query Graph – URL Link Graph

paris hotels

paris attractions

cheap paris hotels

london attractions

An edge exists between q1and q2, if there is at least one link between a url click of q1 and a url click of q2

Node weight =#occurrences

Edge’s weight is the numberof such common links

Page 15: Query Log Analysis

Query Graph –URL Terms Graph

paris hotels

paris attractions

cheap paris hotels

london attractions

Represent a clicked URL bya set of terms(whole page, snippet, anchors, title, a combination …)

Weight terms by their frequencies

Node weight =#occurrences

There’s an edge between q1 andq2 if there are at least m commonterms in at least one clickedurl of q1 and one clicked url of q2

Edge weight is sum of frequenciesof common terms

Page 16: Query Log Analysis

User Behavior as a Predictor of a Successful Search

• Goal: given a sequence of user actions within a specific logical session, predict whether the search goal ended up successfully or not– Success – user is satisfied with the results– Failure – user is unsatisfied

• Method: – Analyze the query log and learn success/failure

patterns– Use learned models for prediction

• Proposed by [Hassan, Jones and Klinkner]

Page 17: Query Log Analysis

Data

• A rich query log of queries and user actions:– Query (Q)– Search Click (SR)– Sponsored Search Click (AD)– Related Search Click (RL)

• Query recommendations– Spelling Suggestion Click (SP)– Shortcut Click (SC)

• E.g. image, video, news …– Any Other Click (OTH)

• E.g. browser tab

Page 18: Query Log Analysis

Data Labeling

• Random sample of user sessions

• Human editors labeled data:– Detected logical sessions– Success/Failure

• definitely successful, probably successful, unsure, probably unsuccessful, and definitely unsuccessful

Page 19: Query Log Analysis

Markov Models

• Partition training data into two splits– successful goals– unsuccessful goals

• For each group construct a Markov Model derived from seen action sequences– A Model describes the user behavior in case of a

successful/unsuccessful search goal– Action type is a state– Weight a transition from one state to another

according to its probability as observed in the data

(MLE)

Page 20: Query Log Analysis

Transition Weighting - MLE

,

Pr ,

, :

:

i j

i

i j

i

S SMLE i j

S

S S

i j

S

i

N NS S

N

N N

Number of times we sawa transition fromS to S

N

Number of times we sawtransition S

Page 21: Query Log Analysis

Illustration

START

Q SR

END

ADRL

1

0.3 0.1

0.6

0.1

0.4

0.5

1 1

Page 22: Query Log Analysis

Prediction (1)• Given a user’s action sequence, need to

predict whether it is successful or not• We’ve learned two models Ms and Mf of

successful and unsuccessful patterns• Compute the probability that a given

sequence S={S1,…,Sn} was generated from Ms, same for Mf

• Predict success/non success by computing log likelihood– Formulas in next slide

Page 23: Query Log Analysis

Prediction (2)

Formulas taken from the paper