unsupervised query segmentation using clickthrough for information retrieval yanen li 1, bo-june...

22
Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1 , Bo-June (Paul) Hsu 2 , ChengXiang Zhai 1 and Kuansan Wang 2 1 Department of Computer Science, University of Illinois at Urbana- Champaign 2 Microsoft Research, Microsoft Research, One Microsoft Way Redmond, WA Email: [email protected] 07/25/2011, SIGIR 2011, Beijing China

Upload: hollie-rogers

Post on 02-Jan-2016

225 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

Unsupervised Query Segmentation Using Clickthrough for Information Retrieval

Yanen Li1, Bo-June (Paul) Hsu2, ChengXiang Zhai1 and Kuansan Wang2

1Department of Computer Science, University of Illinois at Urbana-Champaign2Microsoft Research, Microsoft Research, One Microsoft Way Redmond, WA

Email: [email protected]

07/25/2011, SIGIR 2011, Beijing China

Page 2: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

2

Outline

• Motivation and Related Works• Unsupervised Query Segmentation Model with

Clickthrough • Query Segmentation Evaluation • Integrated Language Model with Query

Segmentation (QSLM)• Evaluation of QSLM• Conclusion and Future Work

Page 3: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

3

This Work:• Task 1: probabilistic query segmentationbank of america online banking{[bank of america] [online banking], 0.502}, {bank of america online banking], 0.428}, {[bank of ] [ america] [online banking], 0.001}

• Task 2: retrieval model with query segmentationQ -> {A(Q)} -> D

Motivation

query segmentation: breaking a query into semantic meaningful segments

bank of america online banking -> [bank of america ] [online banking]

Query seg is useful for: (1) noun phrase discovery; (2) query reformulation; (3) phrase-based retrieval models (4) user intent analysis

Page 4: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

4

Related Work of Query Segmentation• Mutual information based models [Risvik www 03, Jones www 06]

• Supervised query segmentation models– MRF [Yu KEYS 09]– Limitation: need labeled training examples

• Simple N-gram probability models [Hagen SIGIR 10]

• Unsupervised models– [Tan WWW 2008]– Minimum description length

Limitation: no relevance information (example: “of the”, Query: president of the united states)

president | of the | united states?)

We try to model query seg with clickthrough data, which is previously unexplored

Page 5: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

5

Unsupervised Query Segmentation Model using Clickthrough

• Appear both in query and doc • Relevance information• How to model?

Intuitions

Page 6: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

6

1. Pick a query length n under a length distribution; e.g. n=4

2. Select a segmentation partition B B∈ n , according to a segmentation partition model P (B|n, ψ);e.g. [X X ] [X X ]

3. Generate query segments Sm consistent with B, ac-cording to a segment unigram model P(Sm|θ). e.g. [food network ] [coupon codes]

Our Segmentation Model

• A generative model• Generating a query:

Page 7: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

7

• Under this model:

e.g P([the cuban swimmer paper] |θ) VS P(the | θ) P(cuban | θ) P(swimmer | θ) P(paper| θ)

B: segmentation partitionθ: segment unigram distribution. Vocabulary space: 12…K

infinite strong prior that penalizes longer segments

Prob of seeing Q given B

Page 8: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

8

• Extending to <query, doc> pairs

An interpolated model:

global component document-specific component

[President] [of the] [united states]

1. the White House and President Barack Obama, the 44th President of the United States

2. the united states President Barack Obama …3. President Obama remained unable to break a stalemate over the debt…Few investors believe the United States …

QueryClicked docs

Prob is not high for this segmentation

Page 9: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

9

• Parameter estimation

An EM algorithm:e.g. oxford real estate advisors

θ: segment unigram distributionEstimate by maximizing in all query-doc pairs

E step, given θ(k-1), for each Q compute posterior probability of a valid segmentation give Q

e.g. P([X ] [X X ] [ X ] | oxford real estate advisors, θD, ψ)

M step, update θ(k):

P(real estate |θ(k)) P([X] [X X] [X] | oxford real estate advisors, θD, ψ)+ P([X X] [X] | real estate california, θD, ψ)+ P([X] [X] [XX] [X] | find a real estate agent, θD, ψ)+…

Page 10: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

10

Query Segmentation Evaluation • Datasets– Training set from Bing query log

– Test set 1500 queries from [Bergsma EMNLP-CoNLL 2007], 3 annotators

– Test set 21000 queries from Bing query log, 3 annotators

• Metrics– query accuracy– classify accuracy– segment precision– segment recall– segment F– On setA, setB, setC, set Intersection & Conjunction

Page 11: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

11

Result Snapshot

30 [elizabeth nj] [factory outlets]31 [rush university] [medical center]32 [pitch card game] [program]33 [hillsborough] [river] [state park]34 [trane] [vs] [american standard] [a c]35 [jefferson county al] [school system]36 [oxford] [real estate] [advisors]37 [johnson county] [community college]

38 [new york] [insight meditation]39 [aurora ohio] [movie theater]40 [trigun] [maximum] [graphic novels]41 [animals] [redwood] [national park]42 [prime time] [male] [exotic] [dances]43 [pacific grove] [adult] [school]44 [ralph] [ m] [brown] [act]45 [chicago] [gay pride parade]46 [livermore] [mobile home parks]47 [vintage] [harley davidson] [soft] [tail] [standard]

48 [aerotemp] [heat pump] [pools]49 [american indian] [salt] [deficiency]50 [cheap] [crossword puzzle] [books]

2030822 [beauty and the beast]2025251 [history] [of] [armenia]2030690 [american saddlery country flex saddle]2024252 [funny] [award] [certificates]2023090 [champion] [mobile homes]2027667 [pictures] [of] [best friend] [woman] [hugging]2022846 [budget driving school] [san diego]2027746 [publishing] [web site] [internet]2030341 [you tube] [american idol] [results] [april 2 2008]… …

Test Set 1 Test Set 2

Page 12: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

12

Subset Metric Baseline Tan's Models Our Models

MI EM + corpus EM+Clicked Doc

Annotation A query accuracy 0.274 0.414 0.440

classify accuracy 0.693 0.762 0.776

segment precision 0.469 0.562 0.598

segment recall 0.534 0.555 0.639

segment F 0.499 0.558 0.618

Annotation B query accuracy 0.244 0.44 0.410

classify accuracy 0.634 0.774 0.750

segment precision 0.408 0.568 0.521

segment recall 0.472 0.578 0.631

segment F 0.438 0.573 0.571

Annotation C query accuracy 0.264 0.416 0.402

classify accuracy 0.666 0.759 0.756

segment precision 0.451 0.558 0.548

segment recall 0.519 0.561 0.619

segment F 0.483 0.559 0.582

Intersection query accuracy 0.343 0.528 0.586

classify accuracy 0.728 0.815 0.842

segment precision 0.510 0.640 0.681

segment recall 0.550 0.650 0.747

segment F 0.530 0.645 0.713

--Clearly outperforms the MI baseline.-- Outperforms [Tan,

WWW 2008] model according to A, C and Intersection-- Our Model + MS Web n-gram beats other models with additional resources

Evaluation on Test Set 1

Page 13: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

13

Segmentation Performance with Respect to Penalty Factor

1. Penalty Factor can affect the result a lot

1. At f=2 it achieves good results

Page 14: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

14

Integrated Language Model with Query Segmentation (QSLM)

• Traditional IR models– TF-IDF, BM25, Unigram LM …– Terms are scored independently

• Proximity heuristics [Tao SIGIR 07]

• Higher order LMs (biterm LM [Srikanth SIGIR 02])• Capturing linkage [Gao SIGIR 04]

Simple Oracle Ranker

qID Unigram Bigram Oracle2024077 0.33 0.25 0.332024272 0.3 0.34 0.342024291 0.29 0.36 0.36

Oracle Ranker Procedure

ResultRemarks:1. Oracle ranker performs

very well2. Simulate similar behavior

with query seg

Page 15: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

15

QSLM ModelQuery seg prob

LM

1. doc LM model

2. background LM model

Page 16: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

16

bank of america online

1. AOL Inc. (NYSE: AOL, stylized as "Aol.", and previously known as America Online) is an American global Internet services and media company

Document Query Segmentation Prob a/(a+b) Ranking score

Doc 1[bank of america] [online]

0.94 0.6 0.564[bank] [of] [america online]

0.02 0.8 0.0160.58

Doc 2 [bank of america] [online] 0.94 0.9 0.846[bank] [of] [america online] 0.02 0.4 0.008

0.854

2. Online Banking from Bank of America lets you manage your accounts, pay your bills, view credit card activity and more.

How to score docs under QSLM

Page 17: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

17

Evaluation of QSLM on Search Ranking

Dataset from Bing12,064 queries

Results on Web Search

1. Better performance than BM25 and Unigram, Bigram LMs2. Results more significant on longer queries

Baselines:BM25, Unigram LM,Bigram LM

Page 18: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

18

How many segmentations are needed?1. More segmentations, better search ranking2. Small #segmentations is enough

Page 19: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

19

Conclusions and Future Work

• Unsupervised model using clickthrough is effective on query segmentation

• LM with query segmentation can improve search ranking

• But QSLM still underperforms Oracle Ranker• Better model to incorporate query

segmentation is desirable

Page 20: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

20

Acknowledgement

We thank SIGIR for the Travel Grant support!

Page 21: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

21

Questions?Email: [email protected]

Page 22: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department

22

Thank You!