automatic suggestion of query-rewrite rules for enterprise search

25
Automatic Suggestion of Query- Automatic Suggestion of Query- Rewrite Rules for Enterprise Rewrite Rules for Enterprise Search Search Benny Kimelfeld IBM Research – Almaden Zhuowei Bao University of Pennsylvania Yunyao Li IBM Research – Almaden Portland, Oregon, USA Portland, Oregon, USA SIGIR 2012 SIGIR 2012

Upload: yunyao-li

Post on 18-Dec-2014

94 views

Category:

Technology


0 download

DESCRIPTION

Presented by Yunyao at SIGIR 2012

TRANSCRIPT

Page 1: Automatic suggestion of query-rewrite rules for enterprise search

Automatic Suggestion of Query-Automatic Suggestion of Query-Rewrite Rules for Enterprise SearchRewrite Rules for Enterprise Search

Benny KimelfeldIBM Research – Almaden

Zhuowei BaoUniversity of Pennsylvania

Yunyao LiIBM Research – Almaden

Portland, Oregon, USAPortland, Oregon, USASIGIR 2012SIGIR 2012

Page 2: Automatic suggestion of query-rewrite rules for enterprise search

2

Challenges in Enterprise SearchChallenges in Enterprise Search

Network Station Manager searchThin Client Manager Product names

change over night:

Continually changing terminology

Domain-specific meaning

Paula Summa searchbring Paula Summa

from employee directories

per diem search

Domain-specific redundancy

popcorn searchconference

call!

• Result 1: IBM Travel: Per Diem• Result 2: IBM Travel: Per Diem Rates • Result 3: IBM Travel: National perdiems

• Result 25: IBM Travel: Per Diem Policy

An enterprise search engine is managed by admins who are domain experts. Not search

experts!

An enterprise search engine is managed by admins who are domain experts. Not search

experts!

Page 3: Automatic suggestion of query-rewrite rules for enterprise search

3

Programmable Search in IBMProgrammable Search in IBM

• Programmable Search: A philosophy and design of enterprise search [Vaithyanathan, SIGIR’11]

– Backend analysis includes Information Extraction, categorization, and domain-specific variant generation [Zhu et al, WWW’07]

– Search programmable by runtime rules [Agarwal et al., WWW’10, Szpektor et al., WWW’11, Fagin et al., PODS’10/11]

• Gumshoe, a Programmable Search engine, today powering IBM internal & external portals

Background: IBM deployed traditional, black-box search

solutions Quality degraded over time Exposed knobs were insufficient and opaque

Background: IBM deployed traditional, black-box search

solutions Quality degraded over time Exposed knobs were insufficient and opaque

Page 4: Automatic suggestion of query-rewrite rules for enterprise search

4

Engine Architecture & RulesEngine Architecture & Rules

Query rewritingQuery rewriting

Result aggregationResult aggregation

Front endBack end

query

final results

set of new queries

rankedresults

reportcomplaints

User

Search admin

Index

Runtime rules

Grouping Rules• Group results of specified categories• Necessary—domain specific redundancy

Grouping Rules• Group results of specified categories• Necessary—domain specific redundancy

Re-ranking Rules• Re-rank results by specified categories• Semantics based “top” and “bottom” matches

Re-ranking Rules• Re-rank results by specified categories• Semantics based “top” and “bottom” matches

Rewrite Rules• Create new queries to augment/replace the

original query

Rewrite Rules• Create new queries to augment/replace the

original query

Focus here

Matched against the query

Perform when matchQuery pattern → Action

Page 5: Automatic suggestion of query-rewrite rules for enterprise search

5

Rewrite Rules in GumshoeRewrite Rules in Gumshoe

• EQUALS: $x [in PRODUCT] info → $x

• CONTAINS: lotus $x(presentations|spreadsheets) → lotus symphony

• CONTAINS: msn search →+ bing

• Similar to the query-template rules of Agarwal et al. [WWW 2010]

$x is in the product

dictionary $x matches a regex

prefer the new query over old

• The only type of rules considered here• Abbreviation: s → t

Page 6: Automatic suggestion of query-rewrite rules for enterprise search

6

Aiding AdministratorsAiding Administrators

Bad results for query …

I’m missing the golden URL…

Result 22 should be

ranked much higher!

Enterprise Users

Query LogsQuery “global campus” seems

unsatisfying

• Terminology mismatch: user queries vs. docs.

– Rule needed

• The rule should push desired results up the top

• Devise• Deploy• Test

• How are other cases affected by my new rule?

– Revisit my old rules?

• Terminology mismatch: user queries vs. docs.

– Rule needed

• The rule should push desired results up the top

• Devise• Deploy• Test

• How are other cases affected by my new rule?

– Revisit my old rules?

Search Admin

This paper

Page 7: Automatic suggestion of query-rewrite rules for enterprise search

7

Gumshoe Maintenance ToolkitGumshoe Maintenance Toolkit

CIKM 2012 Demo[Bao et al.]

Page 8: Automatic suggestion of query-rewrite rules for enterprise search

• Introduction & BackgroundIntroduction & Background

• Suggesting Natural RulesSuggesting Natural Rules

• Optimizing Rule SelectionOptimizing Rule Selection

• Concluding RemarksConcluding Remarks

OutlineOutlineOutlineOutline

Single rule setting

Multiple-rule setting

Page 9: Automatic suggestion of query-rewrite rules for enterprise search

9

Problem DescriptionProblem Description

• Input: Example (q,d) of a query and a desired match

• Goal: Devise an effective and natural rewrite rule

• Effective: push the desired match up the top

• Natural: should correspond to a semantically coherent replacement of terms• The kind of rules the administrator would devise herself

seasonal flu → avian flu

management change → SCIP

download → ISSI tool

Temporarily correct

Temporarily correct

Organization reconstructio

n

Organization reconstructio

n

Main software access for IBMers

Main software access for IBMers

Page 10: Automatic suggestion of query-rewrite rules for enterprise search

10

AlgorithmAlgorithmInput: Query q, desired match (doc/URL) d

Candidates for sn-grams of q

Candidates for sn-grams of q

Candidates for tn-grams of high-quality

fields of d

Candidates for tn-grams of high-quality

fields of d

Output: Suggested rewrite rules s → t

X

Candidates for s → tCandidates for s → t

Page 11: Automatic suggestion of query-rewrite rules for enterprise search

11

High-Quality Fields of a High-Quality Fields of a DocumentDocument

HTML title

URL (fragments)

Visual title

Page 12: Automatic suggestion of query-rewrite rules for enterprise search

12

But… Many Candidates But… Many Candidates seasonal flu

seasonal flu seasonal flu seasonal flu

seasonal seasonal seasonal seasonal

→→→→→→→→

avian fluh5n1flu employeeand IBM h5n1avianh5n1fluyou and IBM

change management change management change management change management change management

managementmanagement

→→→→→→→

strategy & change internal practicewelcome to strategywelcome strategystrategy change & internalto strategy change internal practiceinternal management index pagesInternal practice scip

We often get ≈ 100 candidates, sometimes ≈ 1000

Page 13: Automatic suggestion of query-rewrite rules for enterprise search

13

AlgorithmAlgorithmInput: Query q, desired match (doc/URL) d

X

Candidates for s → tCandidates for s → t

Classifiernatural/unnatural rules

Classifiernatural/unnatural rules

Output: Suggested rewrite rules s → t

Next:

Effectiveness filterEffectiveness filter

Candidates for sn-grams of q

Candidates for sn-grams of q

Candidates for tn-grams of high-quality

fields of d

Candidates for tn-grams of high-quality

fields of d

Page 14: Automatic suggestion of query-rewrite rules for enterprise search

14

Classification FeaturesClassification Features

• Syntactic features– Whether s (resp., t) begins with a stop word– Whether s (resp., t) ends with a stop word– Number of tokens in s (resp., t)

• Corpus statistics– Logarithm of the frequency of s (resp., t)– Logarithm of the concurrence frequency of s and t– Logarithm of the frequency of s (resp., t) in titles

• Query-log statistics– Logarithm of the s-to-t reformulation frequency

Rule: s → t

Page 15: Automatic suggestion of query-rewrite rules for enterprise search

15

Classification ModelsClassification Models• We take an approach similar to Kraft & Zien [2004]

that explored a problem of a similar flavor

• SVM: a linear classifier

• rDTLC: Decision Tree with Linear-Combination splits [Loh & Shih,1988]– Bound the tree depth (3 in our implementation)– Use univariate splits on non-leaf nodes

fi0 < τ0 ?

fi1 < τ1 ? fi2 < τ2 ?

∑aifi < τ3 ? ∑bifi < τ4 ? ∑cifi < τ5 ? ∑difi < τ6 ?

Yes No Yes No Yes No Yes No

Page 16: Automatic suggestion of query-rewrite rules for enterprise search

16

Experimental SettingExperimental Setting

• Experiments over IBM Intranet search

• 1894 suggested matches (q,d) provided by IBM CIO Office– These are usually matches for highly frequent

queries– 11907 effective candidate rules generated

• “Effective” = pushes d from outside top k for q to inside

• Manually labeled ~1200 candidate rewrite rules

Page 17: Automatic suggestion of query-rewrite rules for enterprise search

17

0.2

0.4

0.6

0.8

1

Experimental ResultsExperimental Results

60

65

70

75

80

85

90

95

100SVM

DecisionTree/SVM

Rules weighted by query

frequency

+ weighted training

Random

Classification Accuracy

Ranking by Classifier Score (MRR)

0.2

0.4

0.6

0.8

1

top-1 top-3 top-5

Ranking by Classifier Score (nDCGk)

Page 18: Automatic suggestion of query-rewrite rules for enterprise search

• Introduction & BackgroundIntroduction & Background

• Suggesting Natural RulesSuggesting Natural Rules

• Optimizing Rule SelectionOptimizing Rule Selection

• Concluding RemarksConcluding Remarks

OutlineOutlineOutlineOutline

Page 19: Automatic suggestion of query-rewrite rules for enterprise search

19

stock → stock marketspreadsheet → symphony

stock → stock marketspreadsheet → symphonyOptimizing Rule SelectionOptimizing Rule Selection

symphony tutorial spreadsheet tutorial spreadsheet tutorial

stock spreadsheet stock market spreadsheetstock spreadsheet

stock symphony

Negative effect

• A rule can negatively affect performance on desired matches

• A rule can interfere with other rules

• Idea: Optimize rule selection

excel spreadsheet excel spreadsheetexcel symphony

Page 20: Automatic suggestion of query-rewrite rules for enterprise search

20

Formal Optimization ProblemFormal Optimization Problem

q1q1

q2q2

q3q3

...

qnqn

p1p1

p2p2

p3p3

...

pnpn

. . .

s1 s2s3

s5s4 s6

Que

ries

Que

ries

Rew

ritten queries

Rew

ritten queries

DocumentsDocuments

ScoresScores

r1

r2

r3,r9r4

r4

r6,r8

Rewrite RulesRewrite Rules

(qi):desired doc. matches for qi

topk(qi):k docs. reachable w/ highest score

( (qi) , topk(qi) ) Quality measure per qii=

1

nGoal: Find a subset of the rewrite rules that

maximizes

Page 21: Automatic suggestion of query-rewrite rules for enterprise search

21

Hardness & HeuristicsHardness & Heuristics

We propose 2 simple heuristic algorithms:

Theorem:

• Finding an optimal set of rules is NP-hard• So is finding any constant-factor approx.• Holds already for k=1• Holds for every quality measure (e.g., DCG, precision@k,

etc.), assuming a very basic well-behavior property• Reduction from maximal independent set

Page 22: Automatic suggestion of query-rewrite rules for enterprise search

22

Greedy Algorithms (High Level) Greedy Algorithms (High Level)

Globally Greedy AlgorithmGlobally Greedy AlgorithmR ← empty set

While(change) {

r ← rule w/ max quality for R+{r}

If(R+{r} is better than R) {

R ← R+{r}

}

}

Return R

Globally Greedy AlgorithmGlobally Greedy AlgorithmR ← empty set

While(change) {

r ← rule w/ max quality for R+{r}

If(R+{r} is better than R) {

R ← R+{r}

}

}

Return R

Locally Greedy AlgorithmLocally Greedy AlgorithmR ← empty set

For all benchmark pairs (q,d) {

REL ← the rules relevant to (q,d)

r ← rule w/ max quality for R+{r} among REL

If(R+{r} is better than R) {

R ← R+{r}

}

}

Return R

Locally Greedy AlgorithmLocally Greedy AlgorithmR ← empty set

For all benchmark pairs (q,d) {

REL ← the rules relevant to (q,d)

r ← rule w/ max quality for R+{r} among REL

If(R+{r} is better than R) {

R ← R+{r}

}

}

Return R

In the paper: weighted versions + running-time optimizations

Page 23: Automatic suggestion of query-rewrite rules for enterprise search

23

0.40.50.60.70.80.9

1

top-1 top-3 top-5

Experiments over IBM Intranet Experiments over IBM Intranet SearchSearch

All Rules

Achieved accuracy for nDCGk (unweighted)

Random (L) Random (G) Greedy (L) Greedy (G) Bound

Baselines

0.4

0.5

0.6

0.7

0.8

0.9

1

Benchmark added

Benchmark added

Achieved accuracy for MRR

WeightedWeighted

More in the paper:

• Additional combinations measure+weight+data

• Large number of rules

• Running times

Local is 2 orders of magnitudes faster than Global

Page 24: Automatic suggestion of query-rewrite rules for enterprise search

• Introduction & BackgroundIntroduction & Background

• Suggesting Natural RulesSuggesting Natural Rules

• Optimizing Rule SelectionOptimizing Rule Selection

• Concluding RemarksConcluding Remarks

OutlineOutlineOutlineOutline

Page 25: Automatic suggestion of query-rewrite rules for enterprise search

25

Summary and Future WorkSummary and Future Work• In programmable search, domain knowledge of the

enterprise is introduced by means of rules• Studied 2 problems of facilitating rule management

– Suggesting natural rules• Candidate generation, classifier for identifying natural rules

– Optimizing rule selection• Unfortunately, the problem quickly gets NP-hard• Presented simple heuristics + optimizations thereof

• Conducted experiments over real data from IBM Intranet search, provided by IBM search administrators

• CIKM 2012 demo• Various challenges remain for future work

– Improving the quality and efficiency of rule suggestion• In particular, indexing, “learning to rank”

– Extending the framework into a richer class of rules• Using dictionaries, regular expressions, etc. Questions?Questions?