learning noun phrase query segmentation€¦ · learning noun phrase query segmentation shane...

28
University of Alberta June 30, 2007 Slide 1 Learning Noun Phrase Query Segmentation Shane Bergsma and Qin Iris Wang University of Alberta EMNLP 2007

Upload: others

Post on 30-Apr-2020

23 views

Category:

Documents


0 download

TRANSCRIPT

University of AlbertaJune 30, 2007 Slide 1

Learning Noun PhraseQuery Segmentation

Shane Bergsma and Qin Iris WangUniversity of Alberta

EMNLP 2007

University of AlbertaJune 30, 2007 Slide 2

Query Segmentation• Input: Search engine query• Output: Query separated into phrases• Goal: Improve information retrieval• Approach: Supervised machine-learning

– classifier makes segmentation decisions• Conclusion: richer features allow for large

increases in segmentation performance

University of AlbertaJune 30, 2007 Slide 3

Outline1. Introduction2. Segmentation as Classification3. Features4. Data and Experiments5. Results

University of AlbertaJune 30, 2007 Slide 4

Growth of the WebTotal Sites Across All Domains August 1995 - June 2007

Netcraft June 2007 Web Server Survey

University of AlbertaJune 30, 2007 Slide 5

Query Segmentation• Matching tokens not sufficient• Need better strategies for interpreting

queries• Example query:

– two man power saw• Interpretation using phrases:

– “man power”? “power saw”?

University of AlbertaJune 30, 2007 Slide 6

Query SegmentationUnsegmented:two man power saw

“two”“man”“power”“saw”

University of AlbertaJune 30, 2007 Slide 7

Query Segmentation“two man”“power saw”

University of AlbertaJune 30, 2007 Slide 8

Query Segmentation“two”“man”“power saw”

University of AlbertaJune 30, 2007 Slide 9

Query Segmentation• Improves precision• Also can help with recall:

– First step in query substitution / expansion:– “two man” “power saw” to:

“two person” “power saw”– Unsegmented:– “two” “man” “power” “saw” to:

“three” “woman” “authority” “witnessed”

University of AlbertaJune 30, 2007 Slide 10

Query Segmentation• How to segment?

– Link tokens with high statistical association• Jones et al. (WWW 2006) use the Mutual

Information (MI):– MI(x,y) = Pr(x,y) / Pr(x)Pr(y)

• Link tokens x and y if their MI > threshold• For an N-token query, 2N-1 segmentations

University of AlbertaJune 30, 2007 Slide 11

Query Segmentation• Similar to Noun Compound Bracketing

– forms binary tree (bracketing) over tokens– [used [car parts]] or [[used car] parts]– In principle, more bracketings than

segmentations• Our goal:

– Apply bracketing statistics used in Nakov &Hearst (CoNLL 2005) to query segmentation

University of AlbertaJune 30, 2007 Slide 12

Segmentation as Classification• Our approach:

– turn query segmentation into classification– discriminatively learn a classifier to make

segmentation decisions• Benefits

– allows large number of possibly overlappingfeatures

– Adapt to training data / task of interest

University of AlbertaJune 30, 2007 Slide 13

Segmentation as Classification“two man” “power saw”

SupportVector

Machine

- two man+ man power- power saw

- <1,0,0,1… >+ <0,0,1,1… >- <0,1,0,1… >

University of AlbertaJune 30, 2007 Slide 14

Segmentation as Classification

… X Y …

University of AlbertaJune 30, 2007 Slide 15

Features• Basic Features

MI(x,y) = Pr(x,y) / Pr(x)Pr(y)log MI(x,y) = log Pr(x,y) – log Pr(x) – log Pr(y) = log C(x,y) – log C(x) – log C(y) + normalizer

• Can use separately:< log C(x,y) , log C(x) , log C(y) >called the Basic features

• Use counts from search engine

University of AlbertaJune 30, 2007 Slide 16

Indicator Features

Position from end of queryReverse position

Position from beginning of queryForward position

Part-of-speech tags of x yPOS-tags

Token x, y = “free”Is-free

Token x, y = “the”Is-the

DescriptionName

University of AlbertaJune 30, 2007 Slide 17

Statistical Features

Counts x, x y, in AOL databaseQuery-DB

Count “x’s y”Genitive

Count “x and y”And-count

Count “xy”Collapsed

Count “the x y”Definite

DescriptionName

University of AlbertaJune 30, 2007 Slide 18

Example• star wars weapons guns

– star wars: high counts of “star wars”,“starwars”, but low “star and wars”

– weapons guns: lower “weapons guns”, low“weaponsguns”, high “weapons and guns”

• Positively weighted and negativelyweighted features work together.

University of AlbertaJune 30, 2007 Slide 19

Summary of Feature Spans

X1 X2 X3 X4 X5 X6

Boundary

University of AlbertaJune 30, 2007 Slide 20

Summary of Feature Spans

X1 X2 X3 X4 X5 X6

BoundaryContextContextContext Context

University of AlbertaJune 30, 2007 Slide 21

Summary of Feature Spans

X1 X2 X3 X4 X5 X6

BoundaryContextContextContext Context

DependencyDependency

University of AlbertaJune 30, 2007 Slide 22

Context Features• Consider the segmentation decision

between “loan” and “amortization” in:bank loan amortization schedule

• Might want to consider association of“bank” and “loan” as well.

• Get pairwise features with left and rightneighbours, trigram, fourgram, andfivegram features, if available.

University of AlbertaJune 30, 2007 Slide 23

Data and Experiments• Use queries from AOL query database

– queries with a click-URL: indicates user’sintentions for the query

– only noun phrase queries – taggeddeterminers, adjectives and nouns

– only queries of length ≥ 4– 500 queries for training, 500 for development,

500 for testing

University of AlbertaJune 30, 2007 Slide 24

Data and Experiments• Annotators asked to annotate queries to

improve search precision• Test set annotated by three annotators• Agreement on segmentation decisions

around 84% - lower than we expected• More details in paper

University of AlbertaJune 30, 2007 Slide 25

Results

0%

20%

40%

60%

80%

Boundary

Boundary

+Context

+Dependency

Boundary

+Context

+Dependency

MI Basic Basic Basic All All All

Seg-

Acc

Qry-Acc

University of AlbertaJune 30, 2007 Slide 26

Conclusion• Proposed a new approach to query

segmentation, allows richer features• Reduces error by 56% over comparison

system• Future work: train query segmentation (or

query expansion / contraction) to directlyoptimize information retrieval performance

University of AlbertaJune 30, 2007 Slide 27

Thanks

University of AlbertaJune 30, 2007 Slide 28

Dependency Features• Consider the segmentation decision

between “female” and “bus” in:female bus driver

• There is a stronger association between“female” and “driver” than “female” and“bus” – might be useful

• Include features between pairs of tokensseparate by a token.