unsupervised query segmentation using clickthrough for information retrieval yanen li 1, bo-june...

Unsupervised Query Segmentation Using Clickthrough for Information Retrieval

Yanen Li1, Bo-June (Paul) Hsu2, ChengXiang Zhai1 and Kuansan Wang2

1Department of Computer Science, University of Illinois at Urbana-Champaign2Microsoft Research, Microsoft Research, One Microsoft Way Redmond, WA

Email: [email protected]

07/25/2011, SIGIR 2011, Beijing China

mailto:[email protected]

2

Outline

• Motivation and Related Works• Unsupervised Query Segmentation Model with

Clickthrough • Query Segmentation Evaluation • Integrated Language Model with Query

Segmentation (QSLM)• Evaluation of QSLM• Conclusion and Future Work

3

This Work:• Task 1: probabilistic query segmentationbank of america online banking{[bank of america] [online banking], 0.502}, {bank of america online banking], 0.428}, {[bank of ] [ america] [online banking], 0.001}

• Task 2: retrieval model with query segmentationQ -> {A(Q)} -> D

Motivation

query segmentation: breaking a query into semantic meaningful segments

bank of america online banking -> [bank of america ] [online banking]

Query seg is useful for: (1) noun phrase discovery; (2) query reformulation; (3) phrase-based retrieval models (4) user intent analysis

4

Related Work of Query Segmentation• Mutual information based models [Risvik www 03, Jones www 06]

• Supervised query segmentation models– MRF [Yu KEYS 09]– Limitation: need labeled training examples

• Simple N-gram probability models [Hagen SIGIR 10]

• Unsupervised models– [Tan WWW 2008]– Minimum description length

Limitation: no relevance information (example: “of the”, Query: president of the united states)

president | of the | united states?)

We try to model query seg with clickthrough data, which is previously unexplored

5

Unsupervised Query Segmentation Model using Clickthrough

• Appear both in query and doc • Relevance information• How to model?

Intuitions

6

1. Pick a query length n under a length distribution; e.g. n=4

2. Select a segmentation partition B B∈ n , according to a segmentation partition model P (B|n, ψ);e.g. [X X ] [X X ]

3. Generate query segments Sm consistent with B, ac-cording to a segment unigram model P(Sm|θ). e.g. [food network ] [coupon codes]

Our Segmentation Model

• A generative model• Generating a query:

7

• Under this model:

e.g P([the cuban swimmer paper] |θ) VS P(the | θ) P(cuban | θ) P(swimmer | θ) P(paper| θ)

B: segmentation partitionθ: segment unigram distribution. Vocabulary space: 12…K

infinite strong prior that penalizes longer segments

Prob of seeing Q given B

8

• Extending to <query, doc> pairs

An interpolated model:

global component document-specific component

[President] [of the] [united states]

1. the White House and President Barack Obama, the 44th President of the United States

2. the united states President Barack Obama …3. President Obama remained unable to break a stalemate over the debt…Few investors believe the United States …

QueryClicked docs

Prob is not high for this segmentation

9

• Parameter estimation

An EM algorithm:e.g. oxford real estate advisors

θ: segment unigram distributionEstimate by maximizing in all query-doc pairs

E step, given θ(k-1), for each Q compute posterior probability of a valid segmentation give Q

e.g. P([X ] [X X ] [ X ] | oxford real estate advisors, θD, ψ)

M step, update θ(k):

P(real estate |θ(k)) P([X] [X X] [X] | oxford real estate advisors, θD, ψ)+ P([X X] [X] | real estate california, θD, ψ)+ P([X] [X] [XX] [X] | find a real estate agent, θD, ψ)+…

10

Query Segmentation Evaluation • Datasets– Training set from Bing query log

– Test set 1500 queries from [Bergsma EMNLP-CoNLL 2007], 3 annotators

– Test set 21000 queries from Bing query log, 3 annotators

• Metrics– query accuracy– classify accuracy– segment precision– segment recall– segment F– On setA, setB, setC, set Intersection & Conjunction

11

Result Snapshot

30 [elizabeth nj] [factory outlets]31 [rush university] [medical center]32 [pitch card game] [program]33 [hillsborough] [river] [state park]34 [trane] [vs] [american standard] [a c]35 [jefferson county al] [school system]36 [oxford] [real estate] [advisors]37 [johnson county] [community college]

38 [new york] [insight meditation]39 [aurora ohio] [movie theater]40 [trigun] [maximum] [graphic novels]41 [animals] [redwood] [national park]42 [prime time] [male] [exotic] [dances]43 [pacific grove] [adult] [school]44 [ralph] [ m] [brown] [act]45 [chicago] [gay pride parade]46 [livermore] [mobile home parks]47 [vintage] [harley davidson] [soft] [tail] [standard]

48 [aerotemp] [heat pump] [pools]49 [american indian] [salt] [deficiency]50 [cheap] [crossword puzzle] [books]

2030822 [beauty and the beast]2025251 [history] [of] [armenia]2030690 [american saddlery country flex saddle]2024252 [funny] [award] [certificates]2023090 [champion] [mobile homes]2027667 [pictures] [of] [best friend] [woman] [hugging]2022846 [budget driving school] [san diego]2027746 [publishing] [web site] [internet]2030341 [you tube] [american idol] [results] [april 2 2008]… …

Test Set 1 Test Set 2

12

Subset Metric Baseline Tan's Models Our Models

MI EM + corpus EM+Clicked Doc

Annotation A query accuracy 0.274 0.414 0.440

classify accuracy 0.693 0.762 0.776

segment precision 0.469 0.562 0.598

segment recall 0.534 0.555 0.639

segment F 0.499 0.558 0.618

Annotation B query accuracy 0.244 0.44 0.410




segment F 0.438 0.573 0.571

Annotation C query accuracy 0.264 0.416 0.402




segment F 0.483 0.559 0.582

Intersection query accuracy 0.343 0.528 0.586




segment F 0.530 0.645 0.713

--Clearly outperforms the MI baseline.-- Outperforms [Tan,

WWW 2008] model according to A, C and Intersection-- Our Model + MS Web n-gram beats other models with additional resources

Evaluation on Test Set 1

13

Segmentation Performance with Respect to Penalty Factor

1. Penalty Factor can affect the result a lot

1. At f=2 it achieves good results

14

Integrated Language Model with Query Segmentation (QSLM)

• Traditional IR models– TF-IDF, BM25, Unigram LM …– Terms are scored independently

• Proximity heuristics [Tao SIGIR 07]

• Higher order LMs (biterm LM [Srikanth SIGIR 02])• Capturing linkage [Gao SIGIR 04]

Simple Oracle Ranker

qID Unigram Bigram Oracle2024077 0.33 0.25 0.332024272 0.3 0.34 0.342024291 0.29 0.36 0.36

…

Oracle Ranker Procedure

ResultRemarks:1. Oracle ranker performs

very well2. Simulate similar behavior

with query seg

15

QSLM ModelQuery seg prob

LM

1. doc LM model

2. background LM model

16

bank of america online

1. AOL Inc. (NYSE: AOL, stylized as "Aol.", and previously known as America Online) is an American global Internet services and media company

Document Query Segmentation Prob a/(a+b) Ranking score

Doc 1[bank of america] [online]

0.94 0.6 0.564[bank] [of] [america online]

0.02 0.8 0.0160.58

Doc 2 [bank of america] [online] 0.94 0.9 0.846[bank] [of] [america online] 0.02 0.4 0.008

0.854

2. Online Banking from Bank of America lets you manage your accounts, pay your bills, view credit card activity and more.

How to score docs under QSLM

17

Evaluation of QSLM on Search Ranking

Dataset from Bing12,064 queries

Results on Web Search

1. Better performance than BM25 and Unigram, Bigram LMs2. Results more significant on longer queries

Baselines:BM25, Unigram LM,Bigram LM

18

How many segmentations are needed?1. More segmentations, better search ranking2. Small #segmentations is enough

19

Conclusions and Future Work

• Unsupervised model using clickthrough is effective on query segmentation

• LM with query segmentation can improve search ranking

• But QSLM still underperforms Oracle Ranker• Better model to incorporate query

segmentation is desirable

20

Acknowledgement

We thank SIGIR for the Travel Grant support!

21

Questions?Email: [email protected]

mailto:[email protected]

22

Thank You!

unsupervised query segmentation using clickthrough for information retrieval yanen li 1, bo-june...

Documents

query reformulation

query segmentationq

query log

query length n

valid segmentation

query segments sm consistent

querydoc pairs e step

interpolated model