predictive parallelization: taming tail latencies in web search
DESCRIPTION
Predictive Parallelization: Taming Tail Latencies in Web Search. Myeongjae Jeon , Saehoon Kim, Seung -won Hwang , Yuxiong He, Sameh Elnikety , Alan L. Cox, Scott Rixner Microsoft Research , POSTECH , Rice University. Performance of Web Search. 1) Query response time - PowerPoint PPT PresentationTRANSCRIPT
1
Predictive Parallelization:Taming Tail Latencies in
Web Search
Myeongjae Jeon, Saehoon Kim, Seung-won Hwang, Yuxiong He, Sameh Elnikety,
Alan L. Cox, Scott RixnerMicrosoft Research, POSTECH, Rice University
2
Performance of Web Search
1) Query response time– Answer quickly to users (e.g., in 300 ms)
2) Response quality (relevance)– Provide highly relevant web pages– Improve with resources and time consumed
Focus: Improving response timewithout compromising quality
3
Background: Query Processing Stages
doc
2nd phase ranking
Snippet generator
Doc. index search
Response
For example:300 ms
latency SLA
QueryFocus: Stage 1
100s – 1000s of good matching docs
10s of the best matching docs
Few sentences for each doc
4
Goal
Speeding up index search (stage 1) without compromising result quality– Improve user experience– Larger index serving– Sophisticated 2nd phase
doc
2nd phase ranking
Snippet generator
Doc. index search
Response
Query
For example:300 ms
latency SLA
5
All web pages
How Index Search Works• Partition all web pages across
index servers (massively parallel)
• Distribute query processing (embarrassingly parallel)
• Aggregate top-k relevant pages
Partition Partition Partition Partition Partition Partition
Indexserver
Indexserver
Indexserver
Indexserver
Indexserver
Indexserver
Aggregator
Top-k pages
Top-k pages
Top-k pages
Top-k pages
Top-k pages
Top-kpages
PagesQuery
Problem:A slow server makes the entire cluster slow
6
Observation
• Query processing on every server. Response time is determined by the slowest one.
• We need to reduce its tail latencies
Latency
Aggregator
Indexservers
Aggregator
Indexservers
Fast response Slow response
7
Examples
• Terminate long query in the middle of processing→ Fast response, but quality drop
Long query(outlier)
8
Parallelism for Tail Reduction
Opportunity• Available idle cores• CPU-intensive workloads
Challenge• Tails are few• Tails are very long
Breakdown LatencyNetwork 4.26 ms
Queueing 0.15 ms
I/O 4.70 ms
CPU 194.95 ms
Latency breakdown for the 99%tile.
Percentile Latency Scale50%tile 7.83 ms x1
75%tile 12.51 ms x1.6
95%tile 57.15 ms x7.3
99%tile 204.06 ms x26.1
Latency distribution
10
Predictive Parallelism for Tail Reduction
• Short queries– Many– Almost no speedup
• Long queries– Few– Good speedup
1 2 3 4 5 60
2
4
6
8
10
0123456
5.2 4.5
< 30 ms
Parallelism Degree
Exec
. Tim
e (m
s)
Spee
dup
1 2 3 4 5 60
50
100
150
200
0123456
169
41
> 80 ms
Parallelism Degree
Exec
. Tim
e (m
s)
Spee
dup
11
Predictive Parallelization Workflow
query Execution time
predictor
Predict (sequential) execution time of the query with high accuracy
Index server
12
Predictive Parallelization Workflow
query Execution time
predictor
Resourcemanager
Index server
Using predicted time, selectively parallelize long queries
short
long
13
Predictive Parallelization
• Focus of Today’s Talk1. Predictor: of long query through machine learning2. Parallelization: of long query with high efficiency
14
Brief Overview of Predictor
Accuracy CostHigh recall for
guaranteeing 99%tile reduction
Low prediction overhead and misprediction cost
In our workload, 4% queries with
> 80 ms
At least 3% must be identified (75% recall)
Existing approaches:Lower accuracy and higher cost
Prediction overhead of 0.75ms or less and high precision
15
Accuracy: Predicting Early Termination
• Only some limited portion contributes to top-k relevant results
• Such portion depends on keyword (or score distribution more exactly)
Inverted index for “SIGIR”
Processing Not evaluated
Doc 1 Doc 2 Doc 3 ……. Doc N-2 Doc N-1 Doc N
Docs sorted by static rankHighest LowestWeb
documents
……. …….
• Term Features [Macdonald et al., SIGIR 12]
– IDF, NumPostings– Score (Arithmetic, Geometric, Harmonic means, max,
var, gradient)• Query features– NumTerms (before and after rewriting)– Relaxed– Language
Space of Features
New Features: Query
• Rich clues from queries in modern search engines
<Fields related to query execution plan>rank=BM25Fenablefresh=1 partialmatch=1language=en location=us ….
<Fields related to search keywords>SIGIR (Queensland or QLD)
• Term Features [Macdonald et al., SIGIR 12]
– IDF, NumPostings– Score (Arithmetic, Geometric, Harmonic means, max,
var, gradient)• Query features– NumTerms (before and after rewriting)– Relaxed– Language
Space of Features
Space of FeaturesCategory FeatureTerm feature(14)
AMeanScoreGMeanScoreHMeanScoreMaxScoreEMaxScoreVarScoreNumPostingsGAvgMaximaMaxNumPostingsIn5%MaxNumThresProKIDF
Query feature(6)
EnglishNumAugTermComplexityRelaxCountNumBeforeNumAfter
• All features cached to ensure responsiveness (avoiding disk access)
• Term features require 4.47GB memory footprint (for 100M terms)
20
Feature Analysis and Selection
• Accuracy gain from boosted regression tree, suggesting cheaper subset
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150.600000000000001
0.650000000000001
0.700000000000001
0.750000000000001
0.800000000000001
0.850000000000001
All featuresSorted features
# features (sorted by importance)
Reca
ll
22
Prediction Performance
• Query features are important• Using cheap features is advantageous– IDF from keyword features + query features– Much smaller overhead (90+% less)– Similarly high accuracy as using all features
80 ms Thresh. Precision(|A∩P|/|P|)
Recall(|A∩P|/|A|) Cost
Keyword features 0.76 0.64 HighAll features 0.89 0.84 High
Cheap features 0.86 0.80 Low
A = actual long queriesP = predicted long queries
• Classification vs. Regression– Comparable accuracy– Flexibility– Algorithms
• Linear regression• Gaussian process regression• Boosted regression tree
Algorithms
Accuracy of Algorithms
• Summary– 80% long queries (> 80 ms) identified– 0.6% short queries mispredicted– 0.55 ms for prediction time with low memory overhead
• Key idea– Parallelize only long queries
• Use a threshold on predicted execution time
• Evaluation– Compare Predictive to other baselines
• Sequential• Fixed• Adaptive
Predictive Parallelism
26
99%tile Response Time
• Outperforms “Parallelize all”
50
100
150
200
Sequential Degree=3
Predictive Adaptive
Query Arrival Rate (QPS)
Resp
onse
Tim
e (m
s)
50% throughput increase
29
Related Work
• Search query parallelism– Fixed parallelization [Frachtenberg, WWWJ 09]– Adaptive parallelization using system load only [Raman et al., PLDI 11] High overhead due to parallelizing all queries
• Execution time prediction– Keyword-specific features only [Macdonald et al., SIGIR 12]→ Lower accuracy and high memory overhead for our target problem
Your query to Bing is now parallelized if predicted as long.
Thank You!
query Execution time
predictor
Resourcemanager
short
long