answering why-not questions on top-k queries

36
Answering Why-not Questions on Top-K Queries Andy He and Eric Lo The Hong Kong Polytechnic University

Upload: andy-he

Post on 16-Apr-2017

61 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Answering Why-Not Questions on Top-K Queries

Answering Why-not Questions on Top-K

QueriesAndy He and Eric Lo

The Hong Kong Polytechnic University

Page 2: Answering Why-Not Questions on Top-K Queries

Background The database community has

focused on the performance issues for decades

Recently more people turn their focus on to the usability issues Supporting keyword search Query auto-completion Explaining your query result (a.k.a. Why

and Why-Not Questions)2/33

Page 3: Answering Why-Not Questions on Top-K Queries

Why-Not Questions You post a query Q Database returns you a result R R gives you “surprise”

E.g., a tuple m that you are expecting in the result is missing, you ask “WHY??!”

You pose a why-not question (Q,R,m) Database returns you an explanation

E3/33

Page 4: Answering Why-Not Questions on Top-K Queries

The (short) history of Why-Not

Chapman and Jagadish “Why Not?” [SIGMOD 09] Select-Project-Join (SPJ) Questions Explanation E = “tell you which operator

excludes the expected tuple” Hung, Che, A.H. Doan, and J. Naughton

“On the Provenance of Non-Answers to Queries Over Extracted Data” [PVLDB 09]

SPJ Queries Explanation E =“tell you how to modify the

data”4/33

Page 5: Answering Why-Not Questions on Top-K Queries

The (short) history of Why-Not

Herschel and Herandez “Explaining Missing Answers to SPJUA Queries”

[PVLDB 10] SPJUA Queries Explanation E =“tell you how to modify the data”

Tran and C.Y. Chan “How to Conquer why-not Questions” [SIGMOD

10] SPJA Queries Explanation E =“tell you how to modify your

query”

5/33

Page 6: Answering Why-Not Questions on Top-K Queries

About this work Why-Not question on Top-k queries. Hotel <Price, Distance to CityCenter>

Top-3 Hotel Weighting worigin =<0.5, 0.5> Result

Rank 1: Sheraton Rank 2: Westin Rank 3: InterContinental

“WHY my favorite Renaissance NOT in the Top-3 result?” If my value of k is too small? Or I should revise my weighting? Or need to modify both k and weighting?

Explanation E = “tell you how to refine your Top-K query in order to get your favorites back to the result”

6/33

Page 7: Answering Why-Not Questions on Top-K Queries

One possible answer-only modify k

Original query Q(koriginal=3,woriginal=<0.5,0.5>)

The ranking of Renaissance under the original weighting woriginal=<0.5,0.5> Rank 1: Sheraton Rank 2: Westin Rank 3: InterContinental Rank 4: Hilton Rank 5: Renaissance

Refined query #1: Q1(k=3,w=<0.5,0.5>)

5

7/33

X

Page 8: Answering Why-Not Questions on Top-K Queries

Another possible answer-only modify weighting

Original query Q(k=3,woriginal=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>) If we set weighting w=<0.1,0.9>

Rank 1: Hotel E Rank 2: Hotel F Rank 3: Renaissance

Refined query #2: Q2(k=3,w=<0.1,0.9>)

8/33

Page 9: Answering Why-Not Questions on Top-K Queries

Yet another possible answer-modify both

Original query Q(k=3,w=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>) Refined query #2: Q2(k=3,w=<0.1,0.9>) If we set weighting w=<0.9,0.1>

Rank 1: Hotel A Rank 2: Hotel B Rank 3: Hotel C … … Rank 10000: Renaissance

Refined query #3: Q3(k=10000,w=<0.9,0.1>)9/33

Page 10: Answering Why-Not Questions on Top-K Queries

Our objective Find the refined query that minimizes

a penalty function with the missing tuple m in the Top-K results

Prefer Modify K PMK

Prefer Modify Weighting

PMW

Never Mind (Default) NM

10/33

Page 11: Answering Why-Not Questions on Top-K Queries

Basic idea For each weighting wi ∈ W

Run PROGRESS(wi, UNTIL-SEE-m) Obtain the ranking ri of m under the

weighting wi Form a refined query Qi(k=ri,w=wi)

Return the refined query with the least penalty

W is infinite!!

!

11/33

Page 12: Answering Why-Not Questions on Top-K Queries

Our approach: sampling For each weighting wi ∈ W

Run PROGRESS(wi, UNTIL-SEE-m) Obtain the ranking ri of m under the

weighting wi Form a refined query Qi(k=ri,w=wi)

Return the refined query with the least penalty

W is a set of weightings draw from a restricted weighting space

Key Theorem: The optimal refined query Qbest is either Q1 or else Qbest has a weighting

wbest in a restricted weighting space.

12/33

W

Page 13: Answering Why-Not Questions on Top-K Queries

How large the sample size should be?

We say a refined query is the best-T% refined query if its penalty is smaller than (1-T)% refined queries

And we hope to get such a query with a probability larger than a threshold Pr

13/33

Page 14: Answering Why-Not Questions on Top-K Queries

The PROGRESS operation can be expensive

Original query Q(k=3,woriginal=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>) If we set weighting w=<0.9,0.1>

Rank 1: Hotel A Rank 2: Hotel B Rank 3: Hotel C … … Rank 10000: Renaissance

Refined query: Q2(k=10000,w=<0.5,0.5>)

Very Slow!!!

14/33

Page 15: Answering Why-Not Questions on Top-K Queries

Two optimization techniques

Stop each PROGRESS operation early Skip some PROGRESS operations

15/33

Page 16: Answering Why-Not Questions on Top-K Queries

Stop earlier The original query

Q(k=3,worigin=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>) If we set weighting w=<0.9,0.1>

Rank 1: Hotel A Rank 2: Hotel B Rank 3: Hotel C … Rank 5: Hotel D …

16/33

Page 17: Answering Why-Not Questions on Top-K Queries

Skip PROGRESS operation(a)

Similar weightings may lead to similar rankings Based on “Reverse Top-K” paper, ICDE’10

Therefore The query result of PROGRESS(wx, UNTIL-SEE-

m) could be used to deduce

The query result of PROGRESS(wy, UNTIL-SEE-m)

[Provided that wx and wy are similar]

17/33

Page 18: Answering Why-Not Questions on Top-K Queries

Skip PROGRESS operation(a)

E.g., Original query Q(k=3,worigin=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>)

Score under w=<0.5,0.5>Hotel ScoreSheraton 10Westin 9InterContinental

8

Hilton 7Renaissance 6

Score under w=<0.6,0.4>Hotel ScoreSheraton 9Westin 10InterContinental

7

Hilton 8Renaissance 5

How the score looks like if

we set w=<0.6,0.4>

18/33

Page 19: Answering Why-Not Questions on Top-K Queries

Skip PROGRESS operation(b)

We can skip a weighting w if we find its change ∆w between the original weighting worigin is too large.

E.g., We have a refined query with penalty equals to 0.5, for a weighting w, if its changing ∆w is 1. We can totally skip it.

19/33

Page 20: Answering Why-Not Questions on Top-K Queries

Experiments Case Study on NBA data Experiments on Synthetic Data

20/33

Page 21: Answering Why-Not Questions on Top-K Queries

Case study on NBA data Compare with a pure random

sampling version Which do not draw sample from the

restricted weighting space but from the complete weighting space

21/33

Page 22: Answering Why-Not Questions on Top-K Queries

Find the top-3 centers in NBA history

5 Attributes (Weighting = 1/5) POINTS REBOUND BLOCKING FIELD GOAL FREE THROW

Initial Result Rank 1: Chamberlain Rank 2: Abdul-Jabber Rank 3: O’Neal

22/33

Page 23: Answering Why-Not Questions on Top-K Queries

Find the top-3 centers in NBA history

Sampling on the restricted sampling space

Sampling on the whole weighting space

Refined query Top-3 Top-7∆k 0 4Time (ms) 156 154Penalty 0.069 0.28

Why Not ?!

We choose “Prefer Modify Weighting”

23/33

Page 24: Answering Why-Not Questions on Top-K Queries

Synthetic Data Uniform, Anti-correlated, Correlated Scalability

24/33

Page 25: Answering Why-Not Questions on Top-K Queries

Varying query dimensions

25/33

Page 26: Answering Why-Not Questions on Top-K Queries

Varying ko

26/33

Page 27: Answering Why-Not Questions on Top-K Queries

Varying the ranking of the missing object

27/33

Page 28: Answering Why-Not Questions on Top-K Queries

Varying the number of missing objects

28/33

Page 29: Answering Why-Not Questions on Top-K Queries

Varying T%

29/33

Time Time

Quality Quality

Page 30: Answering Why-Not Questions on Top-K Queries

Varying Pr

30/33

Page 31: Answering Why-Not Questions on Top-K Queries

Optimization effectiveness

31/33

Page 32: Answering Why-Not Questions on Top-K Queries

Conclusions We are the first one to answer why-not question

on top-k query We prove that finding the optimal answer is

computationally expensive A sampling based method is proposed The optimal answer is proved to be in a restricted

sample space Two optimization techniques are proposed

Stop each PROGRESS operation early Skip some PROGRESS operations

32/33

Page 33: Answering Why-Not Questions on Top-K Queries

ThanksQ&A

Page 34: Answering Why-Not Questions on Top-K Queries

Deal with multiple missing objects M

We have to modify the algorithm a litte bit: Do a simple filtering on the set of

missing objects If mi dominates mj in the data space Remove mi from M Because every time mj

shows up in a top-k result, mi must be there Condition UNTIL-SEE-m becomes UNTIL-

SEE-ALL-OBJECTS-IN-M

34/33

Page 35: Answering Why-Not Questions on Top-K Queries

Penalty Model Original Query Q(3, worigin) Refined Query Q1(5, worigin) Penalty of changing k

∆ k = 5 - 3 = 2 Penalty of changing w

∆ w = ||worigin -worigin||2=0 Basic penalty model

Penalty(5,w0) = λk ∆ k + λw ∆ w (λk + λw = 1)

35/33

Page 36: Answering Why-Not Questions on Top-K Queries

Normalized penalty function

36/33