龙星计划课程 : 信息检索 personalized search & user modeling

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 1

龙星计划课程 :信息检索 Personalized Search & User Modeling

ChengXiang Zhai (翟成祥 ) Department of Computer Science

Graduate School of Library & Information Science

Institute for Genomic Biology, StatisticsUniversity of Illinois, Urbana-Champaign

http://www-faculty.cs.uiuc.edu/~czhai, [email protected]

http://www.cs.uiuc.edu/


What is Personalized Search?• Use more user information than the user’s query in retrieval

– “more information” = user’s interaction history Implicit feedback

– “more information” = user’s judgments or user’s answer to clarification questions explicit feedback

• Personalization can be done in multiple ways:– Personalize the collection– Personalize ranking– Personalize result presentation– …

• Personalized search = user modeling + model exploitation



Why Personalized Search?• The more we know about the user’s information

need, the more likely we can get relevant documents, thus we should know as much as we can about the users

• When a query doesn’t work well, personalized search would be extremely helpful.



Client-Side vs. Server-Side Personalization

• Server-Side (most work, including commercial products): – Sees global information (all documents, all users)– Limited user information (can’t see activities outside search

results)– Privacy issue

• Client-Side (UCAIR):– More information about the user, thus more accurate user

modeling (complete interaction history + other user activities)– More scalable (“distributed personalization”)– Alleviate the problem of privacy

• Combination of server-side and client-side? How?



Outline• A framework for optimal interactive retrieval

• Implicit feedback (no user effort)– Within a search session– For improving result organization

• Explicit feedback (with user effort)– Term feedback– Active feedback

• Improving search result organization



1. A Framework for Optimal Interactive Retrieval [Shen et al. 05]



IR as Sequential Decision Making

User SystemA1 : Enter a query Which documents to present?

How to present them?

Ri: results (i=1, 2, 3, …)Which documents to view?

A2 : View documentWhich part of the document

to show? How?

R’: Document contentView more?

A3 : Click on “Back” button

(Information Need) (Model of Information Need)



Retrieval Decisions

User U: A1 A2 … … At-1 At

System: R1 R2 … … Rt-1

Given U, C, At , and H, choosethe best Rt from all possible

responses to At

History H={(Ai,Ri)} i=1, …, t-1

DocumentCollection

C

Query=“Jaguar”

All possible rankings of C

The best ranking for the query

Click on “Next” button

All possible rankings of unseen docs

The best ranking of unseen docs Rt r(At)

Rt =?



A Risk Minimization Framework

User: U Interaction history: HCurrent user action: AtDocument collection: C

Observed

All possible responses: r(At)={r1, …, rn}

User Model

M=(S, U…) Seen docs

Information need

L(ri,At,M) Loss Function

Optimal response: r* (minimum loss)

( )arg min ( , , ) ( | , , , )tt r r A t tM

R L r A M P M U H A C dM ObservedInferredBayes risk



• Approximate the Bayes risk by the loss at the mode of the posterior distribution

• Two-step procedure– Step 1: Compute an updated user model M* based on the

currently available information– Step 2: Given M*, choose a response to minimize the loss

function

A Simplified Two-Step Decision-Making Procedure

( )

( )

( )

arg min ( , , ) ( | , , , )

arg min ( , , *) ( * | , , , )

arg min ( , , *)

* arg max ( | , , , )

t

t

t

t r r A t tM

r r A t t

r r A t

M t

R L r A M P M U H A C dM

L r A M P M U H A C

L r A M

where M P M U H A C



Optimal Interactive RetrievalUser

A1

U C M*1 P(M1|U,H,A1,C)

L(r,A1,M*1)R1A2

L(r,A2,M*2)R2

M*2 P(M2|U,H,A2,C)

A3 …

Collection

IR system



Refinement of Risk Minimization• r(At): decision space (At dependent)

– r(At) = all possible subsets of C (document selection)– r(At) = all possible rankings of docs in C – r(At) = all possible rankings of unseen docs– r(At) = all possible subsets of C + summarization strategies

• M: user model – Essential component: U = user information need– S = seen documents– n = “Topic is new to the user”

• L(Rt ,At,M): loss function– Generally measures the utility of Rt for a user modeled as M– Often encodes retrieval criteria (e.g., using M to select a ranking of docs)

• P(M|U, H, At, C): user model inference– Often involves estimating a unigram language model U



Case 1: Context-Insensitive IR– At=“enter a query Q”

– r(At) = all possible rankings of docs in C

– M= U, unigram language model (word distribution)

– p(M|U,H,At,C)=p(U |Q)

1

1

1 2

( , , ) (( ,..., ), )

( | ) ( || )

( | ) ( | ) ....( || )

i

i

i t N U

N

i U di

t U d

L r A M L d d

p viewed d D

Since p viewed d p viewed dthe optimal ranking R is given by ranking documents by D



Case 2: Implicit Feedback – At=“enter a query Q”



– H={previous queries} + {viewed snippets}

– p(M|U,H,At,C)=p(U |Q,H)

1

1

1 2

( , , ) (( ,..., ), )

( | ) ( || )

( | ) ( | ) ....( || )

i

i

i t N U

N

i U di

t U d

L r A M L d d

p viewed d D




Case 3: General Implicit Feedback – At=“enter a query Q” or “Back” button, “Next” button

– r(At) = all possible rankings of unseen docs in C

– M= (U, S), S= seen documents



1

1

1 2

( , , ) (( ,..., ), )

( | ) ( || )

( | ) ( | ) ....( || )

i

i

i t N U

N

i U di

t U d

L r A M L d d

p viewed d D




Case 4: User-Specific Result Summary – At=“enter a query Q”

– r(At) = {(D,)}, DC, |D|=k, {“snippet”,”overview”}

– M= (U, n), n{0,1} “topic is new to the user”

– p(M|U,H,At,C)=p(U,n|Q,H), M*=(*, n*)

( , , ) ( , , *, *)( , *) ( , *)

( * || ) ( , *)i

i t i i

i i

d id D

L r A M L D nL D L n

D L n

n*=1 n*=0

i=snippet 1 0i=overview 0 1

( , *)iL n

Choose k most relevant docs If a new topic (n*=1),

give an overview summary;otherwise, a regular snippet summary



What You Should Know• Disadvantages and advantages of client-side vs.

server-side personalization

• The optimal interactive retrieval framework provides a general way to model personalized search – Maximum user modeling– Immediate benefit (“eager feedback”)

• Personalization can be potentially done for all the components and steps in a retrieval system



2. Implicit Feedback [Shen et al. 05, Tan et al. 06]



“Jaguar” Example

Car

Car

Car

Car

Software

Animal

Suppose we know:1. Previous query = “racing

cars” vs. “Apple OS”2. “car” occurs far more frequently than “Apple” in pages browsed by the user in the last 20 days

3. User just viewed an “Apple OS” document



How can we exploit such implicit feedback information that already naturally exists to improve ranking

accuracy?



Risk Minimization for Implicit Feedback – At=“enter a query Q”





1

1

1 2

( , , ) (( ,..., ), )

( | ) ( || )

( | ) ( | ) ....( || )

i

i

i t N U

N

i U di

t U d

L r A M L d d

p viewed d D


Need to estimate a context-sensitive LM



Scenario 1:Use Information in one Session [Shen et al. 05]

Q2

C2={C2,1 , C2,2 ,C2,3 ,

… }…

C1={C1,1 , C1,2 ,C1,3 ,

…} User Clickthrough

Qk

Q1 User Query e.g., Apple software

e.g., Apple - Mac OS X The Apple Mac OS X product page. Describes features in the current version of Mac OS X, …

e.g., Jaguar

1 1 1 1,...,( | ,) ( | ,...,, ) ?k kk kp w p Q CQ Q Cw User Model:

Query History Clickthrough

http://www.apple.com/macosx/



Method1: Fixed Coeff. Interpolation (FixInt)

Qk

Q1

Qk-1

…

C1

Ck-1

…

Average user query history and clickthrough

CH

QH1

11

1

( | ) ( | )k

Q iki

p w H p w Q

11

11

( | ) ( | )k

C iki

p w H p w C

1

H Linearly interpolate history models

( | ) ( | ) (1 ) ( | )C Qp w H p w H p w H

k

1

Linearly interpolate current queryand history model

( | ) ( | ) (1 ) ( | )k kp w p w Q p w H



Method 2: Bayesian Interpolation (BayesInt)

Q1

Qk-1

…

C1

Ck-1

…

Average user query andclickthrough history

CH

QH1

11

1

( | ) ( | )i k

Q iki

p w H p w Q

11

11

( | ) ( | )i k

C iki

p w H p w C

Intuition: trust the current query Qk more if it’s longer

Qk

Dirichlet Prior

( , ) ( | ) ( | )| |( | ) k Q C

k

c w Q p w H p w Hk Qp w

k



Method 3: Online Bayesian Updating (OnlineUp)

'1k

Qk k

C2'2

v

Q1 1Intuition: incremental updating of the language model

C1

v'

1( , )

|| )' (

|( | ) i

i

ic p ww Ci C vp w

Q2 2

'1( ,

|))

|( |( | ) i

i

ic w Q p wi Qp w



Method 4: Batch Bayesian Update (BatchUp)

C2

1k

…Ck-1

'k

1

11

1

( , ) ( | )'

| |( | )

ij kj

ijj

c w C p w

k Cp w

Intuition: all clickthrough data are equally useful

Qk k

Q1 1

C1

1( , ) ( | )| |( | ) i i

i

c w Q p wi Qp w

Q2 2



TREC Style Evaluation

• Data collection: TREC AP88-90

• Topics: 30 hard topics of TREC topics 1-150

• System: search engine + RDBMS

• Context: Query and clickthrough history of 3 participants (http://sifaka.cs.uiuc.edu/ir/ucair/QCHistory.zip)



Example of a Hard Topic<topic><number> 2 (283 relevant docs in 242918 documents)<title> Acquisitions<desc> Document discusses a currently proposed acquisition

involving a U.S. company and a foreign company.<narr> To be relevant, a document must discuss a currently

proposed acquisition (which may or may not be identified by type, e.g., merger, buyout, leveraged buyout, hostile takeover, friendly acquisition). The suitor and target must be identified by name; the nationality of one of the companies must be identified as U.S. and the nationality of the other company must be identified as NOT U.S.

</topic>



Performance of the Hard TopicQ1: acquisition u.s. foreign companyMAP: 0.004; Pr@20: 0.000

Q2: acquisition merge takeover u.s. foreign companyMAP: 0.026; Pr@20: 0.100

Q3: acquire merge foreign abroad international MAP: 0.004; Pr@20: 0.050

Q4: acquire merge takeover foreign european japan MAP: 0.027; Pr@20: 0.200



Overall Effect of Search Context

Query FixInt (=0.1,=1.0)

BayesInt(=0.2,=5.0)

OnlineUp(=5.0,=15.0)

BatchUp(=2.0,=15.0)

MAP pr@20 MAP pr@20 MAP pr@20 MAP pr@20

Q3 0.0421 0.1483 0.0421 0.1483 0.0421 0.1483 0.0421 0.1483Q3+HQ+HC 0.0726 0.1967 0.0816 0.2067 0.0706 0.1783 0.0810 0.2067Improve 72.4% 32.6% 93.8% 39.4% 67.7% 20.2% 92.4% 39.4%Q4 0.0536 0.1933 0.0536 0.1933 0.0536 0.1933 0.0536 0.1933Q4+HQ+HC 0.0891 0.2233 0.0955 0.2317 0.0792 0.2067 0.0950 0.2250Improve 66.2% 15.5% 78.2% 19.9% 47.8% 6.9% 77.2% 16.4%

• Short-term context helps system improve retrieval accuracy

• BayesInt better than FixInt; BatchUp better than OnlineUp



Using Clickthrough Data Only

Query MAP pr@20Q3 0.0421 0.1483

Q3+HC 0.0766 0.2033

Improve 81.9% 37.1%Q4 0.0536 0.1930

Q4+HC 0.0925 0.2283

Improve 72.6% 18.1%BayesInt (=0.0,=5.0)

Clickthrough is the major contributor

13.9% 67.2%Improve0.1880.0739Q4+HC

0.1650.0442Q4

42.4%99.7%Improve0.1780.0661Q3+HC

0.1250.0331Q3

pr@20MAPQueryPerformance

on unseen docs

-4.1%15.7%Improve0.18500.0620Q4+HC

0.19300.0536Q4

23.0%23.8%Improve0.18200.0521Q3+HC

0.14830.0421Q3

pr@20MAPQuery

Snippets for non-relevant docs are still useful!



Sensitivity of BatchUp Parameters

Sensivitiy of mu in BatchUp Model

0

0.02

0.04

0.06

0.08

0.1

0 1 2 3 4 5 6 7 8 9 10

mu

MAP

Q2+Hq+Hc Q3+Hq+Hc Q4+Hq+Hc

• BatchUp is stable with different parameter settings• Best performance is achieved when =2.0; =15.0

Sensivity of nu in BatchUp Model

0

0.02

0.04

0.06

0.08

0.1

0 1 2 5 10 15 30 100 300 500

nu

MAP

Q2+Hq+Hc Q3+Hq+Hc Q4+Hq+Hc



A User Study of Implicit Feedback • UCAIR toolbar (a client-side personalized search

agent using implicit feedback) is used in this study

• 6 participants use UCAIR toolbar to do web search

• 32 topics are selected from TREC Web track and Terabyte track

• Participants evaluate explicitly the relevance of top 30 search results from Google and UCAIR



UCAIR Outperforms Google: Precision at N Docs

Ranking Method

prec@5 prec@10 prec@20 prec@30

Google 0.538 0.472 0.377 0.308UCAIR 0.581 0.556 0.453 0.375Improvement 8.0% 17.8% 20.2% 21.8%

More user interactions better user models better retrieval accuracy



UCAIR Outperforms Google: PR Curve



Scenario 2:Use the Entire History of a User [Tan et al. 06]

• Challenge: Search log is noisy– How do we handle the noise?– Can we still improve performance?

• Solution: – Assign weights to the history data (Cosine, EM

algorithm)

• Conclusions:– All the history information is potentially useful– Most helpful for recurring queries– History weighting is crucial (EM better than Cosine)



Algorithm Illustration



Sample Results: EM vs. Baseline

History is helpful and weighting is important



Sample Results: Different Weighting Methods

EM is better than Cosine; hybrid is feasible



What You Should Know• All search history information helps

• Clickthrough information is especially useful; it’s useful even when the actual document is non-relevant

• Recurring queries get more help, but fresh queries can also benefit from history information



3. Explicit Feedback [Shen et al. 05, Tan et al. 07]



Term Feedback for Information Retrieval with

Language ModelsBin Tan, Atulya Velivelli, Hui Fang,

ChengXiang Zhai

University of Illinois at Urbana-Champaign



Problems with Doc-Based Feedback• A relevant document may contain non-relevant parts

• None of the top-ranked documents is relevant

• User indirectly controls the learned query model



What about Term Feedback? • Present a list of terms to a user and asks for

judgments – More direct contribution to estimating q – Works even when no relevant document on top

• Challenges:– How do we select terms to present to a user? – How do we exploit term feedback to improve our

estimate of q ?



Improve q with Term Feedback

Query RetrievalEngine

d1 3.5d2 2.4... User

Documentcollection

Term FeedbackModels

Improved estimate of q

Term Judgments

Term Extraction Terms



Feedback Term Selection• General (old) idea:

– The original query is used for an initial retrieval run – Feedback terms are selected from top N documents

• New idea: – Model subtopics – Select terms to represent every subtopic well– Benefits

• Avoid bias in term feedback • Infer relevant subtopics, thus achieve subtopic

feedback



oredexpl

un

User-Guided Query Model Refinement

UserExplored

area

Document space

Inferred topic preference directionMost promisingnew topic areasto move to

T1

T2T3

t11t12t21t22t31t32…

++-+--



Collaborative Estimation of q

q

d1

d2

d3

…dN

top N docs ranked byD(q || d)

t1

t2

t3

…tL

judged feedback terms

C1:0.2C2:0.1C3:0.3

…CK:0.1

weighted clusters

q

rank docs byD(q’ || d)

q

’

q

refined query model

t1

t2

t3

…tL

feedback terms

C1

C2

C3

…CK

Subtopic clusters

P(w|1)P(w|2)

P(w|k)

TFBP(t1|TFB)=0.2

…P(t3|TFB)=0.1

…

Original q

P(w|q)

CFBP(w| CFB)=0.2*P(w|1)+ 0.1*P(w|2)+ …

TCFB



Discovering Subtopic Clusters with PLSA [Hofmann 99, Zhai et al. 04]

Document dTheme 1

Theme k

Theme 2

…

Background B

traffic 0.3 railway 0.2..

Tunnel 0.1fire 0.05smoke 0.02 ..

tunnel 0.2amtrack 0.1train 0.05 ..

Is 0.05the 0.04a 0.03 ..

k

1

2

BB

1 - B

d,1

d, k

d,2

W

“Generating” word w in doc d in the collection

Query = “transportation tunnel disaster”

Maximum Likelihood Estimator (EM Algorithm)



Selecting Representative Terms– Original query terms excluded– Shared terms assigned to most likely clusters

Cluster 1 Cluster 2 Cluster 3

tunnel

transport

1. traffic

2. railwai

3. harbor

4. rail

5. bridg

6. kilomet

truck

7. construct

……

0.0768

0.0364

0.0206

0.0186

0.0146

0.0140

0.0139

0.0136

0.0133

0.0131

1. tunnel

2. fire

3. truck

4. french

5. smoke

6. car

7. italian

8. firefight

9. blaze

10. blanc

……

0.0935

0.0295

0.0236

0.0220

0.0157

0.0154

0.0152

0.0144

0.0127

0.0121

tunnel

1. transport

2. toll

3. amtrak

4. train

5. airport

6. turnpik

7. lui

8. jersei

9. pass

……

0.0454

0.0406

0.0166

0.0153

0.0129

0.0122

0.0105

0.0095

0.0093

0.0087

L



User Interface for Term Feedback

Cluster 1 Cluster 3Cluster 2Cluster 1 Cluster 2 Cluster 3Cluster 1 Cluster 2



Experiment Setup• TREC 2005 HARD Track

• AQUAINT corpus (3GB)

• 50 hard query topics

• NIST assessors spend up to 3 min on each topic providing feedback using Clarification Form (CF)

• Submitted CFs: 1x48, 3x16, 6x8

• Baseline: KL-divergence retrieval method with 5 pseudo-feedback docs

• 48 terms generated from top 60 docs of baseline



Retrieval Accuracy Comparison

• 1C: 1x48 3C: 3x16 6C: 6x8

• (except for CFB1C) Baseline < TFB < CFB < TCFB

• CFB1C: user feedback plays no role

Base-line

TFB CFB TCFB

1C 3C 6C 1C 3C 6C 1C 3C 6C

MAP 0.219 0.288 0.288 0.278 0.254 0.305 0.301 0.274 0.309 0.304

PR@30 0.393 0.467 0.475 0.457 0.399 0.480 0.473 0.431 0.491 0.473

RR 4339 4753 4762 4740 4600 4907 4872 4767 4947 4906

MAP% 0% 31.5% 31.5% 26.9% 16.0% 39.3% 37.4% 25.1% 41.1% 38.8%



Reduction of # Terms PresentedTFB CFB TCFB

#terms 1C 3C 6C 3C 6C 3C 6C

6 0.245 0.240 0.227 0.279 0.279 0.281 0.274

12 0.261 0.261 0.242 0.299 0.286 0.297 0.281

18 0.275 0.274 0.256 0.301 0.282 0.300 0.286

24 0.276 0.281 0.265 0.303 0.292 0.305 0.292

30 0.280 0.285 0.270 0.304 0.296 0.307 0.296

36 0.282 0.288 0.272 0.307 0.297 0.309 0.297

42 0.283 0.288 0.275 0.306 0.298 0.309 0.300

48 0.288 0.288 0.278 0.305 0.301 0.309 0.303

#terms=12: 1x12/3x4/6x2



Clarification Form Completion Time

More than half completed in just 1 min



Term Relevance Judgment Quality

CF Type 1x48 3x16 6x8#checked terms 14.8 13.3 11.2#rel. terms 15.0 12.6 11.2#rel. checked terms 7.9 6.9 5.9precision 0.534 0.519 0.527recall 0.526 0.548 0.527

[Zaragoza et al. 04]Term relevance



Had the User Checked all “Relevant Terms”…

TFB1 0.288 -> 0.354TFB3 0.288 -> 0.354TFB6 0.278 -> 0.346CFB3 0.305 -> 0.325CFB6 0.301 -> 0.326TCFB3 0.309 -> 0.345TCFB6 0.304 -> 0.341



Comparison to Relevance Feedback

# FB Docs MAP Pr@30 RelRet5 0.302 0.586 4779

10 0.345 0.670 491620 0.389 0.772 5004

TCFB3C 0.309 0.491 4947

MAP equivalence: TCFB3C = Rel FB with 5 docs



Term Feedback Help Difficult Topics

No rel docsIn top 5



Related Work• Early work: [Harman 88], [Spink 94], [Koenemann &

Belkin 96]…

• More recent: [Ruthven03], [Anick03], …

• Main differences: – Language model– Consistently effective



Conclusions and Future Work• A novel way of improving query model estimation through

term feedback – active feedback based on subtopics – user-system collaboration– achieves large performance improvement over non-feedback

baseline with small amount of user effort– can compete with relevance feedback, esp. in a situation when

the latter is unable to help

• To explore more complex interaction processes– Combination of term feedback and relevance feedback– Incremental feedback



What You Should Know• Term feedback can be quite useful when the query is

difficult and relevance feedback isn’t feasible

• Language models can address weighting well in term feedback



Active Feedback in Ad Hoc IR

Xuehua Shen, ChengXiang ZhaiDepartment of Computer ScienceUniversity of Illinois, Urbana-Champaign



Normal Relevance Feedback (RF)

Feedback

Judgments:d1 +d2 -…dk -

Query RetrievalSystem

Top K Resultsd1 3.5d2 2.4…dk 0.5

User

DocumentCollection



Document Selection in RF

Feedback

Judgments:d1 +d2 -…dk -

Query RetrievalSystem

Which k docs

to present ?

User

DocumentCollection

Can we do better than just presenting top-K? (Consider diversity…)



Active Feedback (AF)

An IR system actively selects documentsfor obtaining relevance judgments

If a user is willing to judge K documents,

which K documents should we present

in order to maximize learning effectiveness?



Outline

• Framework and specific methods

• Experiment design and results

• Summary and future work



A Framework for Active Feedback

• Consider active feedback as a decision problem– Decide K documents (D) for relevance judgment

• Formalize it as an optimization problem– Optimize the expected learning benefits (loss) by

requesting relevance judgments on D from the user

• Consider two cases of loss function according to the interaction between documents– Independent loss: value of each judged document for

learning is independent on each other– Dependent loss



Independent Loss

Rank docs according to expected loss of each individual doc and then select top K docs

Top K

• Constant loss for any relevant and non-relevant docs• Smaller loss for relevant docs

• A doc is more useful for learning if the prediction of relevance is more uncertain

Uncertainty Sampling



Dependent Loss

Heuristics: consider relevancefirst, then diversity

First select Top N docs of baseline retrieval

Cluster N docs into K clusters

K Cluster Centroid

MMR

…

Model diversity and relevance

Gapped Top KPick one doc every G+1 docs



Illustration of Three AF Methods

Top-K (normal feedback)

12345678910111213141516…

GappedTop-K

K-Cluster Centroid

Aiming at high diversity …



Evaluating Active Feedback

QuerySelect K

Docs

K docs

Judgment File

+

Judged Docs

+ ++

--

InitialResultsNo Feedback

(Top-k, Gapped, Clustering)

FeedbackFeedbackResults



Retrieval Methods (Lemur toolkit)

Query Q

DDocument D

Q

)||( DQD Results

KL Divergence

Feedback Docs F={d1, …, dn}

Active Feedback

Default parameter settingsunless otherwise stated

FQQ )1('F

Mixture Model Feedback

Only learn from relevant docs



Comparison of Three AF Methods

Collection Active FB Method

#AFRel per topic

Include judged docsMAP Pr@10doc

HARD Baseline / 0.301 0.501

Pseudo FB / 0.320 0.515Top-K 3.0 0.325 0.527

Gapped 2.6 0.330 0.548Clustering 2.4 0.332 0.565

AP88-89

Baseline / 0.201 0.326Pseudo FB / 0.218 0.343

Top-K 2.2 0.228 0.351Gapped 1.5 0.234 0.389

Clustering 1.3 0.237 0.393Top-K is the worst!Clustering uses fewest relevant docs



Appropriate Evaluation of Active Feedback

New DB(AP88-89, AP90)

Original DBwith judged docs(AP88-89, HARD)

+ -+

Original DBwithout judged docs

+ -+

Can’t tell if the ranking of un-judged documents is improved

Different methods have different test documents

See the learning effectmore explicitly

But the docs must be similar to original docs



Retrieval Performance on AP90 Dataset

Method

Baseline

Pseudo

FB

Top K Gapped Top K

K Cluster Centroid

MAP 0.203 0.220 0.220 0.222 0.223pr@10 0.295 0.317 0.321 0.326 0.325

Top-K is consistently the worst!



Feedback Model Parameter Factor

Mixture Model Parameter alpha factor on the Performance

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.5 0.6 0.7 0.8 0.9 0.95 0.98

alpha

pr@

10do

cs

Top K on HARD

Gapped Top K on HARD

K Cluster Centroid onHARDTop K on AP88-89

Gapped Top K on AP88-89K Cluster Centroid onAP88-89

parameter can amplify the effect of feedback

FQQ )1('



Summary• Introduce the active feedback problem

• Propose a preliminary framework and three methods (Top-k, Gapped Top-k, Clustering)

• Study the evaluation strategy

• Experiment results show that – Presenting the top-k is not the best strategy– Clustering can generate fewer, higher quality feedback

examples



Future Work

• Explore other methods for active feedback

• Develop a general framework

• Combine pseudo feedback and active feedback



What You Should Know• What is active feedback

• Top-k isn’t a good strategy for active feedback; diversifying the results is beneficial



Learn from Web Search Logs to Organize Search Results

Xuanhui Wang and ChengXiang ZhaiDepartment of Computer Science

University of Illinois, Urbana-Champaign



Motivation

• Search engine utility = Ranking accuracy + Result presentation + …

• Lots of research on improving ranking accuracy

• Relatively little work on improving result presentation

What’s the best way to present search results?



Ranked List Presentation



However, when the query is ambiguous…

Query = JaguarCar

Car

Car

Car

Software

Animal

Unlikely optimal for any particular user!



Cluster Presentation (e.g., [Hearst et al. 96, Zamir & Etzioni 99])

From http://vivisimo.com



Deficiencies of Data-Driven Clustering

• Different users may prefer different ways to group the results. E.g., query=“area codes”– “phone codes” vs “zip codes”– “international codes” vs “local codes”

• Cluster labels may not be informative to help a user choose the right cluster. E.g., label = “panthera onca”

Need to group search results from a user’s perspective



Our Idea: User-Oriented Clustering• User-oriented clustering:

– Partition search results according to the aspects interesting to users

– Label each aspect with words meaningful to users

• Exploit search logs to do both– Partitioning

• Learn “interesting aspects” of an arbitrary query• Classify results into these aspects

– Labeling• Learn “representative queries” of the identified aspects • Use representative queries to label the aspects



Rest of the Talk

• General Approach

• Technical Details

• Experiment Results



Illustration of the General Idea

query=“ car”

car rental

car pricing

used car

hertz car rental

car accidents

car audio

car crash

…

Retrieval (over log)

1. {car rental, hertz car rental…}2. {car pricing, used car,…}3. {car accidents, car crash, …}4. {car audio, car stereo, …}5. …

Clustering

www.avis.comwww.hertz.comwww.cars.com…

Results

1. {car rental, hertz car rental,…}www.avis.comwww.hertz.com…

2. {car pricing, used car,…}www.cars.com...

3. {car accidents, car crash,…}…

Categorization

Car rental

Used cars

Car accidents



Query

Search History Collection

Query pseudo doc1

Query pseudo doc2

Clustering

Query Aspect 1

Query Aspect k

…

Results

…

Retrieval

Similar Queries

Categorization…

Labeling

…

Label 1

Label 2

User-Oriented Clustering via Log Mining



Query

Search History Collection

Query pseudo doc1

Query pseudo doc2 …

Results

…

Implementation Strategy

Retrieval

Similar Queries

BM25Lemur

query+clicked snippets

Pooling identical queries

Clustering

Query Aspect 1

Query Aspect k

Star clustering

Labeling

…

Label 1

Label 2

Center query

Categorization…

Centroid-based



More Details: Search Engine Log• Record user activities (queries, clicks)

• Reflect user information needs

• Valuable resources for learning to improve search engine utility

sessions



Recover snippets

More Details: Build History Collection

Pooling “car rental” …

“jaguar car”…

…

session n

For every query (e.g., car rental)

“car rental” “Car rental, rental cars” …,“National car rental” …, …

“jaguar car” “jaguar, car, parts”…, …

… …

History Collection

Clicked urlsU1,U2,…



More Details: Star Clustering [Aslam et al. 04]

6 2

4

1

12

12

3

2 1

1. Form a similarity graph -TF-IDF weight vectors-Cosine similarity-Thresholding

2. Iteratively identify a “star center” and its “satellites”

“Star center” query serves as a label for a cluster



Centroid-Based Classifier• Represent each query doc as a term vector (TF-IDF

weighting)

• Compute a centroid vector for each cluster/aspect

• Assign a new result vector to the cluster whose centroid is the closest to the new vector

Aspect 1

Aspect 2

Aspect 3



Evaluation: Data Preparation• Log data: May 2006 search log released by Microsoft Live

Labs

• First 2/3 to simulate history; last 1/3 to simulate future queries

• History collection (169,057 queries;3.5 clicked URLs/query)

• “Future” collection is further split into two sets for validation and testing

• Test case: a session with more than 4 clicks and at least 100 matching queries in history (172 and 177 test cases in two test sets)

• Use clicked URLs to approximate relevant documents [Joachims, 2002]



Experiment Design• Baseline method

– the original search engine ranking

• Cluster-based method– Traditional method solely based on content

• Log-based method– Our method based on search logs

• Evaluation – Based on a user’s perceived ranking accuracy – A user is assumed to first view the cluster with largest number of

relevant docs – Measures

• Precision@5 documents (P@5)• Mean Reciprocal Rank (MRR) of the first relevant document



Overall Comparison

• Log-based >> baseline

• Log-based >> cluster-based



Diversity Analysis• Do queries with diverse results benefit more?

• Bin by size ratios of the two largest clusters

Queries with diverse results benefit more

Primary/Secondary cluster size ratiomore

diverse



Query Difficulty Analysis• Do difficult queries benefit more?

• Bin by Mean Average Precisions (MAPs)

Difficult queries benefit more

moredifficult



Effectiveness of Learning

more history information

P@

5



Sample Results: Partitioning• Log-based method and regular clustering partition the results

differently

Query: “area codes”

“International codes” or “local codes”

“Phone codes” or “zip codes”



Sample Results: LabelingQuery: apple

Query: jaguar



Related Work• Categorization-based (e.g.,Chen & Dumais 00])

– Labels are meaningful to users– Partitioning may not match a user’s perspective

• Faceted search and browsing (e.g.,[Yee et al. 03])

– Labels are meaningful to users– Partitioning is generally useful for a user– Need faceted metadata

• Rather than pre-specify fixed categories/metadata, we learn them dynamically from search log



Conclusions and Future Work• Proposed a general strategy for organizing search results

based on interesting topic aspects learned from search log

• Experimented with a way to implement the strategy

• Results show that– User-oriented clustering is better than data-oriented clustering– Particularly help difficult topics and topics with diverse results

• Future directions– Mixture of data-driven and user-driven clustering– Study user interaction/feedback with cluster interface– Use general search log to “smooth” personal search log– Query-sensitive result presentation



What You Should Know• Search history for multiple users can be combined to

benefit a particular user’s search

• Difference between user-oriented result organization and data-oriented organization and their advantages and disadvantages

• How to evaluate clustering results indirectly based on the perceived precision



Future Research Directions in Personalized Search

• Robust personalization:– Optimization framework for progressive personalization (gradually

become more and more aggressive in using context/history information)

• More in-depth analysis of implicit feedback information– Why does a user add a query term and then drop it after viewing a

particular document?

• More computer-user dialogue to help bridging the vocabulary gap

• Generally, aim at helping improve performance for difficult topics

• What’s the right architecture for supporting personalized search?



Roadmap• This lecture: Personalized search (understanding

users)

• Next lecture: NLP for IR (understanding documents)



User-Centered

Search Engine

“java”

Personalized search agent

WEB

Search Engine

Email

Search Engine

DesktopFiles

Personalized search agent

“java”

...Viewed Web pages

QueryHistory

A search agent can know abouta particular user very well



User-Centered Adaptive IR (UCAIR)• A novel retrieval strategy emphasizing

– user modeling (“user-centered”)– search context modeling (“adaptive”)– interactive retrieval

• Implemented as a personalized search agent that– sits on the client-side (owned by the user)– integrates information around a user (1 user vs. N

sources as opposed to 1 source vs. N users)– collaborates with each other– goes beyond search toward task support



Challenges in UCAIR• What’s an appropriate retrieval framework for

UCAIR?

• How do we optimize retrieval performance in interactive retrieval?

• How do we develop robust and accurate retrieval models to exploit user information and search context?

• How do we evaluate UCAIR methods?

• ……



The Rest of the Talk• Part I: A risk minimization framework for UCAIR

• Part II: Improve document ranking with implicit feedback

• Part III: User-specific summarization of search results

Joint work with Xuehua Shen, Bin Tan, and Qiaozhu Mei


龙星计划课程 : 信息检索 personalized search & user modeling

Documents