linked data query processing strategies

25
KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institute of Applied Informatics and Formal Description Methods (AIFB) www.kit.edu Linked Data Query Processing Strategies Günter Ladwig, Thanh Tran International Semantic Web Conference 2010, Shanghai

Upload: thanh-tran

Post on 10-May-2015

638 views

Category:

Business


0 download

TRANSCRIPT

Page 1: Linked Data Query Processing Strategies

KIT – University of the State of Baden-Württemberg andNational Large-scale Research Center of the Helmholtz Association

Institute of Applied Informatics and Formal Description Methods (AIFB)

www.kit.edu

Linked Data Query Processing Strategies

Günter Ladwig, Thanh TranInternational Semantic Web Conference 2010, Shanghai

Page 2: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)2 November 11th, 2010

Contents

IntroductionChallenges

Contributions

Linked Data Query Processing Strategies

Stream-based Query Processing

Corrective Source Ranking

Evaluation

Conclusion

ISWC 2010, Shanghai, China

Page 3: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)3 November 11th, 2010

What is Linked Data?

Linked Data PrinciplesUse URIs to identify things

Use HTTP URIs that allow dereferencing

Dereferencing a URI provides information about the thing in a standard format (RDF)

Include links to other, related URIs

Linked Data Query ProcessingEvaluate queries directly over Linked Data

Dereference Linked Data URIs during query processing

ISWC 2010, Shanghai, China

Page 4: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)4 November 11th, 2010

Challenges

Volume of Source CollectionEach URI is a potential data source

Dynamic of Source CollectionSources may change rapidly over time

Sources might only be discovered at run-time

Heterogeneity of Sources, Source Descriptions and Access Methods

Sources vary in size

Description of sources vary in completeness

Access methods: URI lookup, SPARQL endpoints, local cache, ...

ISWC 2010, Shanghai, China

Page 5: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)5 November 11th, 2010

Contributions

Discussion of Linked Data Query Processing strategies

Mixed strategy, combining local indexes and run-time discovery

Stream-based Query ProcessingData can arrive at any time and in any order

Suited to deal with network latency

Corrective Source RankingDeals with different types of source descriptions

Ranking is refined at run-time

ISWC 2010, Shanghai, China

Page 6: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)6 November 11th, 2010

LINKED DATA QUERY PROCESSING STRATEGIES

ISWC 2010, Shanghai, China

Page 7: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)7 November 11th, 2010

Retrieve sourcesJoin data

Select and rank sources

Top-down Query Evaluation

Local index, assumed to be completeSelection and ranking of sources

No run-time discovery

Fast, only relevant sources are retrieved

Not up-to-date, index size may become very large

ISWC 2010, Shanghai, China

SELECT ?paper ?author WHERE {?paper swrc:author ?author . ?paper swc:isPartOf ?proc .?proc swc:relatedToEvent <http://sw.org/eswc/2010> .

}

Local source index

Probe

Source URI Score

http://sw.org/person/AB 0.87

... ...

Page 8: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)8 November 11th, 2010 ISWC 2010, Shanghai, China

Bottom-up Query Evaluation

Sources are discovered at run-time through links

Answers can be incomplete as links might not be discoverable

Slower, as unnecessary sources are retrieved

Always up-to-date

SELECT ?paper ?author WHERE {?paper swrc:author ?author . ?paper swc:isPartOf ?proc .?proc swc:relatedToEvent <http://semweb.org/eswc/2010> . }

<http://sw.org/proc/eswc/2010> swc:relatedToEvent <http://sw.org/eswc/2010> .

...

swc:paper1 swc:isPartOf <http://sw.org/proc/eswc/2010> ....

Retrieve source

Discover new sources

Page 9: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)9 November 11th, 2010

Mixed Strategy

Combination of top-down and bottom-up strategiesPartial local index of sources, not assumed to be complete

New sources are discovered at run-time

Addresses volume and dynamic of Linked Data

Corrective Source RankingDeal with heterogeneous source descriptions

Stream-based Query ProcessingDeal with unpredictable nature of Linked Data access

ISWC 2010, Shanghai, China

Page 10: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)10 November 11th, 2010

STREAM-BASED QUERY PROCESSING

ISWC 2010, Shanghai, China

Page 11: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)11 November 11th, 2010

Query Plan

Source Retrieval

Stream-based Query Processing

Network latencyDo not block!

Evaluation driven by incoming data

Compile-timeConstruct query plan

Probe local index for sources

Run-timeRank sources

Retrieve sources

Push data into query plan

Discover new sources

ISWC 2010, Shanghai, China

Join

Join

worksAt(?x, dbpedia:KIT) knows(?x, ?y)

name(?y, ?n)

Results

Source Retriever 1

Source Retriever 2

...

Push

Source RankerRetrievesource

Sourcediscovered

Source 1 (score: 1.0)Source 2 (score: 0.7) ...

Samples

Local source index

Page 12: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)12 November 11th, 2010

Push-based Symmetric Hash Join

OperationMaintains a hash table for each input

Arriving tuples are inserted into one hash table and then the other is probed for join combinations

Push-basedTuples are pushed into operators from the leaves to the root of the query plan

Execution driven by incoming tuples instead of results

Results reported as soon as input tuples arrive

Tuples can arrive on all inputs in any order

ISWC 2010, Shanghai, China

Key T

a t1, t3

b t2

Key T

b t4, t5

c t6

Left input Right input

Pushed on left: t7(b)

InsertProbe

Push output

t7t4

t7t5

Key T

a t1, t3

b t2, t7

Page 13: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)13 November 11th, 2010

CORRECTIVE SOURCE RANKING

ISWC 2010, Shanghai, China

Page 14: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)14 November 11th, 2010

Corrective Source Ranking

Prefer more relevant sources

Relevancy of a source is based onCurrent query

Any available intermediate results

Overall optimization goal

Define a set of source features and derive concrete source metrics

Not all metrics are available for all sources (heterogeneity)

Refine previously computed metrics using newly discovered information

ISWC 2010, Shanghai, China

Page 15: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)15 November 11th, 2010

Source Features and Metrics

Source is more relevant if it contains data that contributes to answers of the query

Triple Pattern Cardinality

Join Pattern Cardinality

Cardinalities stored in local index

Some patterns have high cardinality for all or many sources (e.g. )

These patterns do not discriminate sources

ISWC 2010, Shanghai, China

Page 16: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)16 November 11th, 2010

Source Features and Metrics

Adopt TF-IDF concept to obtain weights for triple patternsImportance positively correlates with how often bindings to a pattern occur in a source (i.e. cardinality)

Importance negatively correlates with how often its bindings occur in all sources of the source collection S

Triple Frequency – Inverse Source Frequency (TF-ISF)

ISWC 2010, Shanghai, China

Page 17: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)17 November 11th, 2010

Source Features and Metrics - Links

Source linked from many other sources is more relevant

Relevance is higher when these links match query predicates

Links are only discovered at run-time

ISWC 2010, Shanghai, China

Page 18: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)18 November 11th, 2010

Metric Correction and Refinement

During query processing new information becomes available: intermediate join results, links

Refine and correct previously computed metrics

Important in the case of non-discriminative patterns

Instantiate triple pattern of a join with samples of intermediate results to obtain better join size estimates

Example

ISWC 2010, Shanghai, China

Intermediate results in SHJ operatorPerform triple pattern

cardinality lookupsSample

Page 19: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)19 November 11th, 2010

Ranking at Run-time

Optimization goal: early result reportingIndexed sources: triple and join pattern cardinality, TF-ISF, weighted links, sampled join size estimates

Discovered sources: weighted links

Ranking has to be refined at run-time

Parameters influencing behavior and cost of ranking process

Invalid Score Threshold: ranking is performed when the number of sources with invalid scores passes a threshold

Sample Size: larger samples for join size estimation will give better estimates, are also more costly

Resampling Threshold: cache join size estimates and perform sampling only when the hash table of join operator grows past a given threshold

ISWC 2010, Shanghai, China

Page 20: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)20 November 11th, 2010

EVALUATION

ISWC 2010, Shanghai, China

Page 21: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)21 November 11th, 2010

Evaluation

Systems: top-down (TD), bottom-up (BU), mixed (MI)

8 queries over various datasets (DBpedia, Geonames, NYT, Freebase, ...)

To make the approaches comparable, sources were restricted to those discoverable by the BU approach

~6200 sources, containing ~500k triplesSources hosted on local proxy server with artificial delay of 2 seconds

25% of sources were randomly chosen to construct index for MI

ISWC 2010, Shanghai, China

Page 22: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)22 November 11th, 2010

Results

ISWC 2010, Shanghai, China

Query 1 Query 6

BU MI TD BU MI TD

25% Results 24810.5 10300.0 11038.0 8222.5 4743.5 5545.0

50% Results 43464.5 40782.0 15787.0 10961.5 7650.5 5634.0

Total 84066.5 86895.5 44323.5 24086.0 20711.0 16469.0

Src. Selection

0.0 853.0 1444.5 0.0 1331.0 1863.5

Ranking 25.5 2404.0 411.5 23.5 292.5 335.0

#Sources 622 612 154 236 92 49

Overall early result reporting

25% results: MI 8.7s, BU 15.1s

50% results: MI 12.8s, BU 22.0s

Improvement of ~42%

Detailed results for two queries:

Page 23: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)23 November 11th, 2010

Result Arrival Times

ISWC 2010, Shanghai, China

Page 24: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)24 November 11th, 2010

Ranking Heuristics

ISWC 2010, Shanghai, China

Page 25: Linked Data Query Processing Strategies

Institute of Applied Informatics and Formal Description Methods (AIFB)25 November 11th, 2010

Conclusion

Mixed strategy for Linked Data Query ProcessingPartial knowledge available beforehand, incorporated with source discovery at run-time

Corrective Source RankingMetrics for source relevancy

Refinement of ranking at run-time

Stream-based Query Processing

Early results reported on average 42% faster

Future workAdapt query plan to changing properties of incoming data

Query local and remote data

ISWC 2010, Shanghai, China