recent and robust query auto-completion - www 2014 conference presentation

Query Auto-CompletionSTEWART WHITING AND JOEMON M. JOSE

UNIVERSITY OF GLASGOW, SCOTLAND, UK

Recent and Robust

Auto-completion is a common feature in text (e.g.

search) interfaces - (query auto-completion, QAC)

As a query is typed, user is shown possible

completions

Query Auto-CompletionIntroduction Motivation Approaches Experimentation Results Conclusions

Typing queries is hard (especially ‘good’ queries)

Minimise physical and cognitive effort

during query input

QAC is a prominent/high-impact feature –

performance is very noticeable

Why?

DefinitionsIntroduction Motivation Approaches Experimentation Results Conclusions

N ranked completion suggestions composed of

(previously seen) completed queries matching the input prefix

User-typed prefix of N characters

QAC ResearchIntroduction Motivation Approaches Experimentation Results Conclusions

Several engineering challenges

Trie-based data structures

Instantaneous response time

Typo resilience

Term re-ordering

Reduced memory storage complexity

- a large body of work

EFFICIENCY *EFFECTIVENESS

Improving completion suggestion ranking

Well-known simple approaches

Incorporating context features(time, location, personalisation)

Relatively little attention(especially outside of industry)

Few scientific comparisons(hindered by lack of open query logs and

reproducible baseline results)

QAC Effectiveness GoalIntroduction Motivation Approaches Experimentation Results Conclusions

QAC must suggest the user’s intended

query after the least possible keystrokes

QAC must guess what the user will type

To be effective:

Which means…

… not trivial!

huge space of possible completions

Query popularity distribution constantly changes

over time

Temporal factors, emerging/ongoing events and

phenomena

Changing Query DistributionIntroduction Motivation Approaches Experimentation Results Conclusions

Completing prefix ‘k’ at Google on 2013-09-23

Westgate Mall AttackKenya

Predictably popular queriesAlways popular, seasonality (Christmas etc), foreseeable (TV/known events etc)

Unpredictably popular queriesBreaking news, events and phenomena

20% of Google queries haven’t beenseen in last 90 days (* many long-tail)

QAC must support all these queries!

Role of Time in QACIntroduction Motivation Approaches Experimentation Results Conclusions

TRADE OFF

Completion suggestions must include consistently and recently popular queries

RecencyTime sensitivity to

new previously unseen or

unpopular queries

RobustnessReliable ranking of always popular queries

Opposing objectives – can a trade-off be reached?

MotivationsIntroduction Motivation Approaches Experimentation Results Conclusions

Time series modelling [Shokouhi, SIGIR 2012] has several issues• Not all events on a constant schedule

(e.g. Easter, public holidays)• Lag and over-fitting prove problematic in TS• However, our approach is complementary

Large-scale news, events andworld phenomena play a centralrole in search behaviour

…hence, likelihood of a user typing query

Many org’s don’t have years of query logs• Need to take a more short-term approach• Rely on identifying short-range query

popularity using recently observed trends

Naively relying on long-term query popularity smooths recent query trends

Yet, relying on only short-term query popularity makes QAC susceptible to random fluctuations

…the trade off

Ranked completion suggestions composed

of past queries are provided for prefix at

time qt

Only evidence prior to qt is available for

ranking (real-time constraint)

QAC ApproachesIntroduction Motivation Approaches Experimentation Results Conclusions

‘th’

‘so’

‘the world…’ ‘think…’

‘south…’ ‘sogou’

Two distinct QAC ranking approaches:

1. Assume current query distribution is same as previously observed

2. Predict current query distribution based on trends

Max. Likelihood (MLE-ALL)Introduction Motivation Approaches Experimentation Results Conclusions

Maximum Likelihood Estimate (MLE)Based on past evidence, which query is the user

probabilistically most likely to type?

Common approach to QAC - ‘MostPopularQuery’

[BarYossef2010,Shokouhi2012]

Soft Baseline

Use all available query popularity

evidence (from query log) prior to qt

to measure P(q)

‘so’ ranking: (by popularity)70: Southwest58: South America30: South Korea10: Southern Fried Chicken…

Max. Likelihood (MLE-W)Introduction Motivation Approaches Experimentation Results Conclusions

As MLE-ALL…

But, use only recent query log evidence

2, 4, 7, 14, 28 day sliding window of

query log to compute P(q)

How long is enough?

Hard Baseline

(with optimal sliding window)

Only short-term data available -

Unable to use [Shokouhi2012] time series modelling baseline

Last N Queries (LNQ)Introduction Motivation Approaches Experimentation Results Conclusions

Prefix popularity is not uniform (long-tail distribution)

Imposing a strict last N day window… too much evidence for popular prefixes… too little evidence for unpopular prefixes

Use last N queries observed with prefix (+flood control)

How many past queries?We experiment with N = 100, 200, 400, 800, 1200

A very light-weight practical approach -implemented using double-ended queue ‘deque’ data structure

Predicted Next N Queries (PNQ)Introduction Motivation Approaches Experimentation Results Conclusions

Don’t rely passively on past distribution -

predict short-range query popularity

- i.e. query occurrence in next 200 queries

Use only recent trends

Where ‘recent’ is based on the prefix popularity

Avoiding over-fitting and lag

Model Variables: Multiple (M) LNQs to track recent query popularity in ‘windows’

Predictive Modelq1 = {1,1,2,1},q2 = {4,8,10,16} …

4x200 (4x LNQ, where N = 200)

Model Training: Stochastic gradient descent (SGD –fast!) to incrementally fit linear regression modelparameters online…following every 200 queries observed with prefix

MLE-W (sliding window) / LNQ (Last N Queries) / PNQ (Predicted Next N Queries) approaches all have a parameter

We can train/test a single parameter on average over a collection

Or, allow parameter for differentprefixes to change over time

Online Learning (O-…)Introduction Motivation Approaches Experimentation Results Conclusions

A ‘meta-approach’

Feedback framework to select optimal parameter onlinerank completion suggestions using the best performingparameter (by MRR) in the last Δ queries for each prefix

Δ = 100, 300, 600 queries

QAC provides instant feedback…We know the correct completion suggestion

soon after our ranking

Real-time QAC SimulationIntroduction Motivation Approaches Experimentation Results Conclusions

Simulate a real-time scenario for experimentsStep through real queries from a real query logProvide 4 completion suggestions

Using ground truth real query logs: AOL/MSN/SogouExtracted ‘typed’ queries from logs

Measure QAC effectiveness using MRRInsensitive over millions of queries – need to consider even small changes

Real user behaviour…QAC could modify user behaviour – possible biasLiteral matching vs. semantic matching (‘AA’ or ‘American Airlines’)

Simulation Art (Thomas Briggs - salientimages.com)

Language Query length Query volume

Experimental Query LogsIntroduction Motivation Approaches Experimentation Results Conclusions

AOL MSN Sogou

Language English (US) English (US) Chinese (Simplified)

Period March-May 2006 May 2006 June 2008

Typed Queries 18.1M 11.9M 25.1M

Avg. per Hour 11.8K 29K 65K

Avg. per Day 196K 383K 837K

Avg. Query Length 17 chars 17.4 chars 6.5 chars

Std. dev. 11 12 4

Initial learning periods: compare performance

over same periods

First 14 days of MSN/Sogou,

28 days of AOL

Cold-start ‘learning’ period

Time is language

independent

3 diverse open query logs for experiments

MLE-ALL (soft baseline)

MLE-W (hard baseline)

LNQ

PNQ

O-… (online learning)

Compare all approaches overall

Other than MLE-ALL/W, results reported as relative change (+…-%) to MLE-W (hard baseline)

ResultsIntroduction Motivation Approaches Experimentation Results Conclusions

All results are statistically significantMillions of queries in each query log - so rely on effect size instead

Baseline Approach: MLE-ALLIntroduction Motivation Approaches Experimentation Results Conclusions

QAC varies considerably between systems

Short prefixes relatively poor

MSN > AOL, at short prefixes

Sogou far better (expectedly – Chinese vs

Latin alphabet)

AOL sometimes responds differently/with less effect (sampling/breadth issues)

- See paper for AOL results

0.1124

0.1702

0.21060.234

0.0962

0.1527

0.19690.2304

0.4117

0.5487

0.5970.6129

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

2 3 4 5

Mean

Reci

pro

cal

Ran

k (M

RR

)

Prefix length (chars)

Sogou

AOL

MSN

Baseline Approach: MLE-WIntroduction Motivation Approaches Experimentation Results Conclusions

+2.95% +2.28%+0.68%

-20%

-15%

-10%

-5%

+0%

+5%

2 days 4 days 7 days 14 days

MR

R C

han

ge

Sliding Window (days)

MSN

2 chars 3 chars 4 chars 5 chars

+3.59% +3.36% +3.03%

-12%

-10%

-8%

-6%

-4%

-2%

+0%

+2%

+4%

+6%

2 days 4 days 7 days 14 days

MR

R C

han

ge

Sliding Window (days)

Sogou

2 chars 3 chars 4 chars 5 chars

Using a sliding window has a considerable effect

Generally, for a small prefix a small window best

Don’t reach optimal window for >2 character prefixes - but pattern emerging

Using only recent evidence can improve QAC – but can also harm

Clear relationship emerging between time and performance

Approach: LNQIntroduction Motivation Approaches Experimentation Results Conclusions

+3.27%+3.82% +3.92%

+1.92%+2.16% +2.05%

-4%

-3%

-2%

-1%

+0%

+1%

+2%

+3%

+4%

+5%

100 200 400 800 1200

MR

R C

han

ge

LNQ - Last N Queries

MSN

+4.58%+4.88%

+4.65%

+3.17% +3.21%+3.03%

+0%

+1%

+2%

+3%

+4%

+5%

+6%

100 200 400 800 1200

MR

R C

han

ge

LNQ - Last N Queries

Sogou

2 chars

3 chars

2 chars

3 chars

LNQ supports an optimal trade-off between recency and robustness

Practical technique leads to consistent QAC performance gains

Substantial gains for MSN and Sogou

Greater N (=1200) for MSN

Lesser N (=200) for Sogou

Sogou is much less sensitive to N in contrast

Approach: PNQIntroduction Motivation Approaches Experimentation Results Conclusions

+3.86% +3.95% +4.06%

+1.96% +2.02% +2.11%

+0.0%

+0.5%

+1.0%

+1.5%

+2.0%

+2.5%

+3.0%

+3.5%

+4.0%

+4.5%

10x50 5x100 20x50 10x100 5x200

MR

R C

han

ge

MxLNQ Regression Model

MSN

+5.14% +5.13% +5.11%

+3.38% +3.36% +3.35%

+0%

+1%

+2%

+3%

+4%

+5%

+6%

10x50 5x100 20x50 10x100 5x200

MR

R C

han

ge

MxLNQ Regression Model

Sogou

Small improvement over optimal LNQ

Less sensitive to parameters compared to LNQ

Despite high predictive accuracy, PNQ barely improves QAC performance (discussed in paper)

2 chars

3 chars

2 chars

3 chars

Prediction overhead offers only marginal improvements over best LNQ parameters

Further work: more elaborate (non-linear) short range prediction approaches

Approach: Online Learning (O-…)Introduction Motivation Approaches Experimentation Results Conclusions

+3.45%

+4.11%

+3.94%

+3.0%

+3.2%

+3.4%

+3.6%

+3.8%

+4.0%

+4.2%

100 300 600

MR

R C

han

ge

Delta (learn over past N queries)

MSN

+3.83%

+5.42%

+5.28%

+3.0%

+3.5%

+4.0%

+4.5%

+5.0%

+5.5%

+6.0%

100 300 600

MR

R C

han

ge

Delta (learn over past N queries)

Sogou

O-LNQ

O-MLE-W

O-PNQ

(2 character prefixes only)

Performance variance for MSN – approaches/horizon

Little sensitivity to delta in Sogou

O-LNQ performs optimally

Parameters are also time-sensitive

Online learning approach yields highest QAC effectiveness - for all query logs (inc. AOL)

O-LNQ

O-MLE-W

O-PNQ

Side-by-side: best performing approaches/parameters

All Approaches ComparedIntroduction Motivation Approaches Experimentation Results Conclusions

% MRR change over soft (non-temporal) baseline

Prefix MLE-W LNQ PNQ O-LNQ(best online)

MSN

2 chars +2.95% +6.98% +7.13% +7.18%

3 chars +0.27% +2.44% +2.39% +2.71%

Sogou

2 chars +3.59% +8.64% +8.91% +9.21%

3 chars -0.19% +3.02% +3.18% +3.32%

Relatively small differences between approaches

Higher gains for 2 character prefix

Optimised simple LNQworks well in all cases

Online learnt O-LNQ provides state-of-the-art baseline

Research ConclusionsIntroduction Motivation Approaches Experimentation Results Conclusions

Recency is an important part of QAC- But, so too is robustness

These opposing objectives must be considered simultaneously

Proposed several approaches

Comprehensive experiments for 3 query logs

All approaches work (stat. sig) – LNQ online learning most effective- up to +9.2% MRR improvement

Ongoing work to improve short-range query popularity prediction and incorporate context features

1. Several light-weight and practical approaches for QAC research + implementation

2. Optimised simple techniques work well!

3. Time in QAC is language-independent

4. No approach is best in all scenarios – choose carefully

Take-AwaysIntroduction Motivation Approaches Experimentation Results Conclusions

Approach implementation and simulation code (C#) is available on GitHub for further work

All results easily reproducible – using open query logs + code

@stewhir | http://www.stewh.com

recent and robust query auto-completion - www 2014 conference presentation

Science

query inputqac

shortrange query popularity

years of query logsneed

robust autocompletion

completion suggestions

lack of open query logs

emergingongoing events

google queries havent