recent and robust query auto-completion - www 2014 conference presentation
TRANSCRIPT
Query Auto-CompletionSTEWART WHITING AND JOEMON M. JOSE
UNIVERSITY OF GLASGOW, SCOTLAND, UK
Recent and Robust
Auto-completion is a common feature in text (e.g.
search) interfaces - (query auto-completion, QAC)
As a query is typed, user is shown possible
completions
Query Auto-CompletionIntroduction Motivation Approaches Experimentation Results Conclusions
Typing queries is hard (especially ‘good’ queries)
Minimise physical and cognitive effort
during query input
QAC is a prominent/high-impact feature –
performance is very noticeable
Why?
DefinitionsIntroduction Motivation Approaches Experimentation Results Conclusions
N ranked completion suggestions composed of
(previously seen) completed queries matching the input prefix
User-typed prefix of N characters
QAC ResearchIntroduction Motivation Approaches Experimentation Results Conclusions
Several engineering challenges
Trie-based data structures
Instantaneous response time
Typo resilience
Term re-ordering
Reduced memory storage complexity
- a large body of work
EFFICIENCY *EFFECTIVENESS
Improving completion suggestion ranking
Well-known simple approaches
Incorporating context features(time, location, personalisation)
Relatively little attention(especially outside of industry)
Few scientific comparisons(hindered by lack of open query logs and
reproducible baseline results)
QAC Effectiveness GoalIntroduction Motivation Approaches Experimentation Results Conclusions
QAC must suggest the user’s intended
query after the least possible keystrokes
QAC must guess what the user will type
To be effective:
Which means…
… not trivial!
huge space of possible completions
Query popularity distribution constantly changes
over time
Temporal factors, emerging/ongoing events and
phenomena
Changing Query DistributionIntroduction Motivation Approaches Experimentation Results Conclusions
Completing prefix ‘k’ at Google on 2013-09-23
Westgate Mall AttackKenya
Predictably popular queriesAlways popular, seasonality (Christmas etc), foreseeable (TV/known events etc)
Unpredictably popular queriesBreaking news, events and phenomena
20% of Google queries haven’t beenseen in last 90 days (* many long-tail)
QAC must support all these queries!
Role of Time in QACIntroduction Motivation Approaches Experimentation Results Conclusions
TRADE OFF
Completion suggestions must include consistently and recently popular queries
RecencyTime sensitivity to
new previously unseen or
unpopular queries
RobustnessReliable ranking of always popular queries
Opposing objectives – can a trade-off be reached?
MotivationsIntroduction Motivation Approaches Experimentation Results Conclusions
Time series modelling [Shokouhi, SIGIR 2012] has several issues• Not all events on a constant schedule
(e.g. Easter, public holidays)• Lag and over-fitting prove problematic in TS• However, our approach is complementary
Large-scale news, events andworld phenomena play a centralrole in search behaviour
…hence, likelihood of a user typing query
Many org’s don’t have years of query logs• Need to take a more short-term approach• Rely on identifying short-range query
popularity using recently observed trends
Naively relying on long-term query popularity smooths recent query trends
Yet, relying on only short-term query popularity makes QAC susceptible to random fluctuations
…the trade off
Ranked completion suggestions composed
of past queries are provided for prefix at
time qt
Only evidence prior to qt is available for
ranking (real-time constraint)
QAC ApproachesIntroduction Motivation Approaches Experimentation Results Conclusions
‘th’
‘so’
‘the world…’ ‘think…’
‘south…’ ‘sogou’
Two distinct QAC ranking approaches:
1. Assume current query distribution is same as previously observed
2. Predict current query distribution based on trends
Max. Likelihood (MLE-ALL)Introduction Motivation Approaches Experimentation Results Conclusions
Maximum Likelihood Estimate (MLE)Based on past evidence, which query is the user
probabilistically most likely to type?
Common approach to QAC - ‘MostPopularQuery’
[BarYossef2010,Shokouhi2012]
Soft Baseline
Use all available query popularity
evidence (from query log) prior to qt
to measure P(q)
‘so’ ranking: (by popularity)70: Southwest58: South America30: South Korea10: Southern Fried Chicken…
Max. Likelihood (MLE-W)Introduction Motivation Approaches Experimentation Results Conclusions
As MLE-ALL…
But, use only recent query log evidence
2, 4, 7, 14, 28 day sliding window of
query log to compute P(q)
How long is enough?
Hard Baseline
(with optimal sliding window)
Only short-term data available -
Unable to use [Shokouhi2012] time series modelling baseline
Last N Queries (LNQ)Introduction Motivation Approaches Experimentation Results Conclusions
Prefix popularity is not uniform (long-tail distribution)
Imposing a strict last N day window… too much evidence for popular prefixes… too little evidence for unpopular prefixes
Use last N queries observed with prefix (+flood control)
How many past queries?We experiment with N = 100, 200, 400, 800, 1200
A very light-weight practical approach -implemented using double-ended queue ‘deque’ data structure
Predicted Next N Queries (PNQ)Introduction Motivation Approaches Experimentation Results Conclusions
Don’t rely passively on past distribution -
predict short-range query popularity
- i.e. query occurrence in next 200 queries
Use only recent trends
Where ‘recent’ is based on the prefix popularity
Avoiding over-fitting and lag
Model Variables: Multiple (M) LNQs to track recent query popularity in ‘windows’
Predictive Modelq1 = {1,1,2,1},q2 = {4,8,10,16} …
4x200 (4x LNQ, where N = 200)
Model Training: Stochastic gradient descent (SGD –fast!) to incrementally fit linear regression modelparameters online…following every 200 queries observed with prefix
MLE-W (sliding window) / LNQ (Last N Queries) / PNQ (Predicted Next N Queries) approaches all have a parameter
We can train/test a single parameter on average over a collection
Or, allow parameter for differentprefixes to change over time
Online Learning (O-…)Introduction Motivation Approaches Experimentation Results Conclusions
A ‘meta-approach’
Feedback framework to select optimal parameter onlinerank completion suggestions using the best performingparameter (by MRR) in the last Δ queries for each prefix
Δ = 100, 300, 600 queries
QAC provides instant feedback…We know the correct completion suggestion
soon after our ranking
Real-time QAC SimulationIntroduction Motivation Approaches Experimentation Results Conclusions
Simulate a real-time scenario for experimentsStep through real queries from a real query logProvide 4 completion suggestions
Using ground truth real query logs: AOL/MSN/SogouExtracted ‘typed’ queries from logs
Measure QAC effectiveness using MRRInsensitive over millions of queries – need to consider even small changes
Real user behaviour…QAC could modify user behaviour – possible biasLiteral matching vs. semantic matching (‘AA’ or ‘American Airlines’)
Simulation Art (Thomas Briggs - salientimages.com)
Language Query length Query volume
Experimental Query LogsIntroduction Motivation Approaches Experimentation Results Conclusions
AOL MSN Sogou
Language English (US) English (US) Chinese (Simplified)
Period March-May 2006 May 2006 June 2008
Typed Queries 18.1M 11.9M 25.1M
Avg. per Hour 11.8K 29K 65K
Avg. per Day 196K 383K 837K
Avg. Query Length 17 chars 17.4 chars 6.5 chars
Std. dev. 11 12 4
Initial learning periods: compare performance
over same periods
First 14 days of MSN/Sogou,
28 days of AOL
Cold-start ‘learning’ period
Time is language
independent
3 diverse open query logs for experiments
MLE-ALL (soft baseline)
MLE-W (hard baseline)
LNQ
PNQ
O-… (online learning)
Compare all approaches overall
Other than MLE-ALL/W, results reported as relative change (+…-%) to MLE-W (hard baseline)
ResultsIntroduction Motivation Approaches Experimentation Results Conclusions
All results are statistically significantMillions of queries in each query log - so rely on effect size instead
Baseline Approach: MLE-ALLIntroduction Motivation Approaches Experimentation Results Conclusions
QAC varies considerably between systems
Short prefixes relatively poor
MSN > AOL, at short prefixes
Sogou far better (expectedly – Chinese vs
Latin alphabet)
AOL sometimes responds differently/with less effect (sampling/breadth issues)
- See paper for AOL results
0.1124
0.1702
0.21060.234
0.0962
0.1527
0.19690.2304
0.4117
0.5487
0.5970.6129
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
2 3 4 5
Mean
Reci
pro
cal
Ran
k (M
RR
)
Prefix length (chars)
Sogou
AOL
MSN
Baseline Approach: MLE-WIntroduction Motivation Approaches Experimentation Results Conclusions
+2.95% +2.28%+0.68%
-20%
-15%
-10%
-5%
+0%
+5%
2 days 4 days 7 days 14 days
MR
R C
han
ge
Sliding Window (days)
MSN
2 chars 3 chars 4 chars 5 chars
+3.59% +3.36% +3.03%
-12%
-10%
-8%
-6%
-4%
-2%
+0%
+2%
+4%
+6%
2 days 4 days 7 days 14 days
MR
R C
han
ge
Sliding Window (days)
Sogou
2 chars 3 chars 4 chars 5 chars
Using a sliding window has a considerable effect
Generally, for a small prefix a small window best
Don’t reach optimal window for >2 character prefixes - but pattern emerging
Using only recent evidence can improve QAC – but can also harm
Clear relationship emerging between time and performance
Approach: LNQIntroduction Motivation Approaches Experimentation Results Conclusions
+3.27%+3.82% +3.92%
+1.92%+2.16% +2.05%
-4%
-3%
-2%
-1%
+0%
+1%
+2%
+3%
+4%
+5%
100 200 400 800 1200
MR
R C
han
ge
LNQ - Last N Queries
MSN
+4.58%+4.88%
+4.65%
+3.17% +3.21%+3.03%
+0%
+1%
+2%
+3%
+4%
+5%
+6%
100 200 400 800 1200
MR
R C
han
ge
LNQ - Last N Queries
Sogou
2 chars
3 chars
2 chars
3 chars
LNQ supports an optimal trade-off between recency and robustness
Practical technique leads to consistent QAC performance gains
Substantial gains for MSN and Sogou
Greater N (=1200) for MSN
Lesser N (=200) for Sogou
Sogou is much less sensitive to N in contrast
Approach: PNQIntroduction Motivation Approaches Experimentation Results Conclusions
+3.86% +3.95% +4.06%
+1.96% +2.02% +2.11%
+0.0%
+0.5%
+1.0%
+1.5%
+2.0%
+2.5%
+3.0%
+3.5%
+4.0%
+4.5%
10x50 5x100 20x50 10x100 5x200
MR
R C
han
ge
MxLNQ Regression Model
MSN
+5.14% +5.13% +5.11%
+3.38% +3.36% +3.35%
+0%
+1%
+2%
+3%
+4%
+5%
+6%
10x50 5x100 20x50 10x100 5x200
MR
R C
han
ge
MxLNQ Regression Model
Sogou
Small improvement over optimal LNQ
Less sensitive to parameters compared to LNQ
Despite high predictive accuracy, PNQ barely improves QAC performance (discussed in paper)
2 chars
3 chars
2 chars
3 chars
Prediction overhead offers only marginal improvements over best LNQ parameters
Further work: more elaborate (non-linear) short range prediction approaches
Approach: Online Learning (O-…)Introduction Motivation Approaches Experimentation Results Conclusions
+3.45%
+4.11%
+3.94%
+3.0%
+3.2%
+3.4%
+3.6%
+3.8%
+4.0%
+4.2%
100 300 600
MR
R C
han
ge
Delta (learn over past N queries)
MSN
+3.83%
+5.42%
+5.28%
+3.0%
+3.5%
+4.0%
+4.5%
+5.0%
+5.5%
+6.0%
100 300 600
MR
R C
han
ge
Delta (learn over past N queries)
Sogou
O-LNQ
O-MLE-W
O-PNQ
(2 character prefixes only)
Performance variance for MSN – approaches/horizon
Little sensitivity to delta in Sogou
O-LNQ performs optimally
Parameters are also time-sensitive
Online learning approach yields highest QAC effectiveness - for all query logs (inc. AOL)
O-LNQ
O-MLE-W
O-PNQ
Side-by-side: best performing approaches/parameters
All Approaches ComparedIntroduction Motivation Approaches Experimentation Results Conclusions
% MRR change over soft (non-temporal) baseline
Prefix MLE-W LNQ PNQ O-LNQ(best online)
MSN
2 chars +2.95% +6.98% +7.13% +7.18%
3 chars +0.27% +2.44% +2.39% +2.71%
Sogou
2 chars +3.59% +8.64% +8.91% +9.21%
3 chars -0.19% +3.02% +3.18% +3.32%
Relatively small differences between approaches
Higher gains for 2 character prefix
Optimised simple LNQworks well in all cases
Online learnt O-LNQ provides state-of-the-art baseline
Research ConclusionsIntroduction Motivation Approaches Experimentation Results Conclusions
Recency is an important part of QAC- But, so too is robustness
These opposing objectives must be considered simultaneously
Proposed several approaches
Comprehensive experiments for 3 query logs
All approaches work (stat. sig) – LNQ online learning most effective- up to +9.2% MRR improvement
Ongoing work to improve short-range query popularity prediction and incorporate context features
1. Several light-weight and practical approaches for QAC research + implementation
2. Optimised simple techniques work well!
3. Time in QAC is language-independent
4. No approach is best in all scenarios – choose carefully
Take-AwaysIntroduction Motivation Approaches Experimentation Results Conclusions
Approach implementation and simulation code (C#) is available on GitHub for further work
All results easily reproducible – using open query logs + code
@stewhir | http://www.stewh.com