predictive parallelization: taming tail latencies in web search

1

Predictive Parallelization:Taming Tail Latencies in

Web Search

Myeongjae Jeon, Saehoon Kim, Seung-won Hwang, Yuxiong He, Sameh Elnikety,

Alan L. Cox, Scott RixnerMicrosoft Research, POSTECH, Rice University

2

Performance of Web Search

1) Query response time– Answer quickly to users (e.g., in 300 ms)

2) Response quality (relevance)– Provide highly relevant web pages– Improve with resources and time consumed

Focus: Improving response timewithout compromising quality

3

Background: Query Processing Stages

doc

2nd phase ranking

Snippet generator

Doc. index search

Response

For example:300 ms

latency SLA

QueryFocus: Stage 1

100s – 1000s of good matching docs

10s of the best matching docs

Few sentences for each doc

4

Goal

Speeding up index search (stage 1) without compromising result quality– Improve user experience– Larger index serving– Sophisticated 2nd phase

doc

2nd phase ranking

Snippet generator

Doc. index search

Response

Query

For example:300 ms

latency SLA

5

All web pages

How Index Search Works• Partition all web pages across

index servers (massively parallel)

• Distribute query processing (embarrassingly parallel)

• Aggregate top-k relevant pages

Partition Partition Partition Partition Partition Partition

Indexserver

Indexserver

Indexserver

Indexserver

Indexserver

Indexserver

Aggregator

Top-k pages

Top-k pages

Top-k pages

Top-k pages

Top-k pages

Top-kpages

PagesQuery

Problem:A slow server makes the entire cluster slow

6

Observation

• Query processing on every server. Response time is determined by the slowest one.

• We need to reduce its tail latencies

Latency

Aggregator

Indexservers

Aggregator

Indexservers

Fast response Slow response

7

Examples

• Terminate long query in the middle of processing→ Fast response, but quality drop

Long query(outlier)

8

Parallelism for Tail Reduction

Opportunity• Available idle cores• CPU-intensive workloads

Challenge• Tails are few• Tails are very long

Breakdown LatencyNetwork 4.26 ms

Queueing 0.15 ms

I/O 4.70 ms

CPU 194.95 ms

Latency breakdown for the 99%tile.

Percentile Latency Scale50%tile 7.83 ms x1

75%tile 12.51 ms x1.6

95%tile 57.15 ms x7.3

99%tile 204.06 ms x26.1

Latency distribution

10

Predictive Parallelism for Tail Reduction

• Short queries– Many– Almost no speedup

• Long queries– Few– Good speedup

1 2 3 4 5 60

2

4

6

8

10

0123456

5.2 4.5

< 30 ms

Parallelism Degree

Exec

. Tim

e (m

s)

Spee

dup

1 2 3 4 5 60

50

100

150

200

0123456

169

41

> 80 ms

Parallelism Degree

Exec

. Tim

e (m

s)

Spee

dup

11

Predictive Parallelization Workflow

query Execution time

predictor

Predict (sequential) execution time of the query with high accuracy

Index server

12

Predictive Parallelization Workflow


predictor

Resourcemanager

Index server

Using predicted time, selectively parallelize long queries

short

long

13

Predictive Parallelization

• Focus of Today’s Talk1. Predictor: of long query through machine learning2. Parallelization: of long query with high efficiency

14

Brief Overview of Predictor

Accuracy CostHigh recall for

guaranteeing 99%tile reduction

Low prediction overhead and misprediction cost

In our workload, 4% queries with

> 80 ms

At least 3% must be identified (75% recall)

Existing approaches:Lower accuracy and higher cost

Prediction overhead of 0.75ms or less and high precision

15

Accuracy: Predicting Early Termination

• Only some limited portion contributes to top-k relevant results

• Such portion depends on keyword (or score distribution more exactly)

Inverted index for “SIGIR”

Processing Not evaluated

Doc 1 Doc 2 Doc 3 ……. Doc N-2 Doc N-1 Doc N

Docs sorted by static rankHighest LowestWeb

documents

……. …….

• Term Features [Macdonald et al., SIGIR 12]

– IDF, NumPostings– Score (Arithmetic, Geometric, Harmonic means, max,

var, gradient)• Query features– NumTerms (before and after rewriting)– Relaxed– Language

Space of Features

New Features: Query

• Rich clues from queries in modern search engines

<Fields related to query execution plan>rank=BM25Fenablefresh=1 partialmatch=1language=en location=us ….

<Fields related to search keywords>SIGIR (Queensland or QLD)

• Term Features [Macdonald et al., SIGIR 12]

– IDF, NumPostings– Score (Arithmetic, Geometric, Harmonic means, max,

var, gradient)• Query features– NumTerms (before and after rewriting)– Relaxed– Language

Space of Features

Space of FeaturesCategory FeatureTerm feature(14)

AMeanScoreGMeanScoreHMeanScoreMaxScoreEMaxScoreVarScoreNumPostingsGAvgMaximaMaxNumPostingsIn5%MaxNumThresProKIDF

Query feature(6)

EnglishNumAugTermComplexityRelaxCountNumBeforeNumAfter

• All features cached to ensure responsiveness (avoiding disk access)

• Term features require 4.47GB memory footprint (for 100M terms)

20

Feature Analysis and Selection

• Accuracy gain from boosted regression tree, suggesting cheaper subset

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150.600000000000001

0.650000000000001

0.700000000000001

0.750000000000001

0.800000000000001

0.850000000000001

All featuresSorted features

# features (sorted by importance)

Reca

ll

22

Prediction Performance

• Query features are important• Using cheap features is advantageous– IDF from keyword features + query features– Much smaller overhead (90+% less)– Similarly high accuracy as using all features

80 ms Thresh. Precision(|A∩P|/|P|)

Recall(|A∩P|/|A|) Cost

Keyword features 0.76 0.64 HighAll features 0.89 0.84 High

Cheap features 0.86 0.80 Low

A = actual long queriesP = predicted long queries

• Classification vs. Regression– Comparable accuracy– Flexibility– Algorithms

• Linear regression• Gaussian process regression• Boosted regression tree

Algorithms

Accuracy of Algorithms

• Summary– 80% long queries (> 80 ms) identified– 0.6% short queries mispredicted– 0.55 ms for prediction time with low memory overhead

• Key idea– Parallelize only long queries

• Use a threshold on predicted execution time

• Evaluation– Compare Predictive to other baselines

• Sequential• Fixed• Adaptive

Predictive Parallelism

26

99%tile Response Time

• Outperforms “Parallelize all”

50

100

150

200

Sequential Degree=3

Predictive Adaptive

Query Arrival Rate (QPS)

Resp

onse

Tim

e (m

s)

50% throughput increase

29

Related Work

• Search query parallelism– Fixed parallelization [Frachtenberg, WWWJ 09]– Adaptive parallelization using system load only [Raman et al., PLDI 11] High overhead due to parallelizing all queries

• Execution time prediction– Keyword-specific features only [Macdonald et al., SIGIR 12]→ Lower accuracy and high memory overhead for our target problem

Your query to Bing is now parallelized if predicted as long.

Thank You!


predictor

Resourcemanager

short

long

predictive parallelization: taming tail latencies in web search

Documents

tile latency

index servers

index search stage

bing index server

long 8breakdownlatencynetwork4

tail reductionopportunity30

response timewithout

observationquery processing