linked data top-k query processing

25
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Institute of Applied Informatics and Formal Description Methods (AIFB) Top-k Linked Data Query Processing Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer

Upload: wagner-andreas

Post on 10-May-2015

458 views

Category:

Education


2 download

DESCRIPTION

"Linked Data Top-K Query Processing" paper at ESWC'12.

TRANSCRIPT

Page 1: Linked Data Top-K Query Processing

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu

Institute of Applied Informatics and Formal Description Methods (AIFB)

Top-k Linked Data Query ProcessingAndreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer

Page 2: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

2

Evaluation Results

Top-k Linked Data Query Processing

Introduction and Motivation

Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer

Page 3: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

3

INTRODUCTION & MOTIVATION

Page 4: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

4

Linked Data Query Processing

Problems: Efficiency and Scalability

Linked Data Query Processing Engine

data

data sources

Src.

URI

HTTP lookup

Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer

Page 5: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

5

Top-K Query Processing

Users are usually interested in only a few results

Top-K query processing addresses the efficiency and scalability issues

ex:sgt_pepper foaf:name "Sgt. Pepper";ex:song "Lucy".

ex:help foaf:name "Help!"; ex:song "Help!".

ex:beatles foaf:name "The Beatles"; ex:album ex:sgt_pepper; ex:album ex:help.

Src. 1Src. 2

Src. 3

SELECT * WHERE { ex:beatles ex:album ?album . ?album ex:song ?song .}

Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer

Page 6: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

6

Contributions

Transfer top-k query processing to the Linked Data setting

Linked Data specific improvements of the top-k approach

Evaluation using real-world data

Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer

Page 7: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

7

TOP-K LINKED DATA QUERY PROCESSING

Page 8: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

8

Linked Data Query Processing Engine

Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer

Top-K Query Processing in a Linked Data Setting (1) – Requirements (1)

Source index mapping triple patterns to sources containing bindings (e.g., [1,2])

Ranking function determining the relevance of triple pattern bindings

ex:sgt_pepper foaf:name "Sgt. Pepper";ex:song "Lucy".

Src. 2

ex:help foaf:name "Help!"; ex:song "Help!".

Src. 3ex:beatles foaf:name "The Beatles"; ex:album ex:sgt_pepper; ex:album ex:help.

Src. 1

TP1: ex:beatles ex:album ?album . TP2: ?album ex:song ?song .

source index

score [2,3] ∈

score [1,2] ∈

score [0,1] ∈

TP1

TP2

TP2

Page 9: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

9

TP2: ?album ex:song ?song

Top-K Query Processing in a Linked Data Setting (2) – Requirements (2)

Sorted access on each join input

Src. 2

TP1:ex:beatles ex:album ?album

Bindings withdescendingscores

SchedulingStrategy

Src. 3score [2,3] ∈

2

3Src. 1

score [0,1] ∈

1

score [1,2] ∈

Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer

Page 10: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

10

Top-K Query Processing in a Linked Data Setting (3) – Push Bound Rank Join (1)

Sorted Access forex:beatles ex:album ?album .

Sorted Access for?album ex:song ?song

Score Query Bindings – Output Queue

Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer

Scheduling Strategy: Load source 1

ex:beatles foaf:name "The Beatles"; ex:album ex:sgt_pepper; ex:album ex:help.

Src. 1ex:help foaf:name "Help!"; ex:song "Help!".

Src. 3

Score Seen Triples (TP1) Score Seen Triples (TP2)Score Seen Triples (TP2)

3 ex:help ex:song "Help!"

Scheduling Strategy: Load source 3

Score Seen Triples (TP1)

1 ex:beatles ex:album ex:sgt_pepper

1 ex:beatles ex:album ex:help

Page 11: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

11

Top-K Query Processing in a Linked Data Setting (4) – Push Bound Rank Join (2)

Score Query Bindings – Output Queue

4 ex:beatles ex:album ex:help .ex:help ex:song "Help!" .

Threshold: 4

Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer

Src. 2

Sorted Access forex:beatles ex:album ?album .

Sorted Access for?album ex:song ?song

Score Seen Triples (TP2)

3 ex:help ex:song "Help!"

Score Seen Triples (TP1)

1 ex:beatles ex:album ex:sgt_pepper

1 ex:beatles ex:album ex:help

Found query binding with score ≥ threshold

STOP

Page 12: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

12

Score Seen Triples (TP2)Score Seen Triples (TP1)

Improving the Threshold Estimation (1)

Threshold estimation:

max_1

min_1

max_2

min_2

Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer

+

max_1 + min_2

We improve the threshold estimation:Star-shaped entity query bounds

Look-ahead bounds

max_2 + min_1

upperbound seen

upperbound unseen

Threshold: max { , }

Page 13: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

13

Improving the Threshold Estimation (2) Star-shaped Entity Query Bounds

Observation: Results for entity queries come from one single source

Idea: Upper bound scores for triple pattern bindings via the maximal possible triple score

score [2,3] ∈

ex:help foaf:name "Help!"; ex:song "Help!".

Src. 3

ex:sgt_pepper foaf:name "Sgt. Pepper";ex:song "Lucy".

Src. 2

score [1,2] ∈

upper-bound for triple bindings: 3

?x

?y

?zfoaf:name

ex:song

upper-bound for triple bindings: 3

upper bound for entity query bindings: 3 + 3

Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer

Page 14: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

14

Improving the Threshold Estimation (3) Look-ahead Bounds

Idea: Provide a more accurate upper bound for the unseen bindings scores via the „next possible“ score

Score Query Bindings – Output Queue

4 ex:beatles ex:album ex:help .ex:help ex:song "Help!" .

max_1 = 1

min_1 = 1

Threshold: max { 1 + 3 , 1 + 3 } = 4

max_2 = 3

min_2 = 3

Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer

score ∈ [1,2]

Threshold: max { 1 + 2 , 1 + 3 } = 4

Score Seen Triples (TP1)

1 ex:beatles ex:album ex:sgt_pepper

1 ex:beatles ex:album ex:help

Score Seen Triples (TP2)

3 ex:help ex:song "Help!"

Sorted Access for?album ex:song ?song

Sorted Access forex:beatles ex:album ?album .

Src. 2

Src. 3

min_2 = 2

Page 15: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

15

EVALUATION

Page 16: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

16

Evaluation – Setting

We implemented three systemsPush-based symmetric hash join operator [2,5]

Standard top-k operator [6]

Improved top-k operator

Query set: 20 queries (8 FedBench and 12 own queries), having varying result size (1 to ~10.000) and complexity (2 to 5 triple patterns)

Data set: ~ 2.000.000 triples, distributed over ~700.000 sources

Parameters: k {1,5,10,20} ∈ and score distributions ∈{uniform, normal, exponential}

Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer

Page 17: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

17

Evaluation – Results (1)

Overall Results

Top-k strategies lead to runtime improvement of 35% on average (compared to standard Linked Data processing)

Tighter bounding lead to further improvements of 12% on average (compared to standard top-k processing)

Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer

Overview of processing times for all queries (k = 1, d = n)

Page 18: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

18

Evaluation – Results (2)

Effect of K and Score Distributions

Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer

Page 19: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

19

CONCLUSION

Page 20: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

20

Conclusion

We showed that top-k processing techniques are applicable to the Linked Data setting.

Top-k strategies lead to significant time savings w.r.t. small values of k (in our experiments 35% on average)

We showed that our improved top-k strategy lead to further runtime advantages (in our experiments 12% on average)

Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer

Page 21: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

21

QUESTIONS

Page 22: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

22

REFERENCES

Page 23: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

23

References

[1] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. Data summaries for on-demand queries over linked data. In World Wide Web,

2010.

[2] G. Ladwig and T. Tran. Linked Data Query Processing Strategies. In ISWC, 2010.

[3] M. Wu, L. Berti-Equille, A. Marian, C. M. Procopiuc, and D. Srivastava. Processing top-k join queries. Proc. VLDB Endow., pages 860–870, 2010.

[4] A. Harth, S. Kinsella, and S. Decker. Using naming authority to rank data and

ontologies for web search. In ISWC, pages 277–292, 2009.

[5] G. Ladwig and T. Tran. SIHJoin: Querying Remote and Local Linked Data. In

ESWC, 2011.

[6] K. Schnaitter and N. Polyzotis. Optimal algorithms for evaluating rank joins in

database systems. ACM Trans. Database Syst., 35:6:1–6:47, 2010.

Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer

Page 24: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

24

BACKUP SLIDES

Page 25: Linked Data Top-K Query Processing

Institute of Applied Informatics and Formal Description Methods (AIFB)

25

Early Pruning of Partial Results

Motivation: Top-k join processing can be quite costly in terms of memory consumption

Idea: Prune such partial query results that cannot contribute to a final top-k result

?x

?yex:song

foaf:name ?z

upper-bound for triple bindings: 3

Rank Triple Pattern Binding

1 ex:sgt_pepper ex:song "Getting Better".

Currently known top-2 results:

Rank Query Bindings – Output Queue

6 ex:help foaf:name "Help!".ex:help ex:song "Help!" .

4 ex:sgt_pepper foaf:name "Sgt. Pepper".ex:sgt_pepper ex:song "Lucy".

Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer

+

Currently known partial results:

maximal score: 3 + 1 = 4