1 efficient sparql query processing in mapreduce through data partitioning and indexing nie zhi...

1

Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing

Nie [email protected]

2

Outline

Introduction Related work SPARQL Query Processing in MapReduce Experiments Conclusion

3

Outline


4

RDF

Resource Description Framework subject-predicate-object expressions (S-P-O)

Nobel Prize in Physics

阿尔伯特•爱因斯坦

Albert EinsteinisCalled

hasWonPrize

wasBornIn


Ulm

http://www.mpii.de/yago/resource/



hasWonPrize

wasBornIn

isCalled

S

OP

5

SPARQL Query Language for RDF

PREFIX source:<http://www.mpii.de/yago/resource/>SELECT ?name ?whereWHERE {?who source:hasWonPrize Nobel Prize in Physics.?who source:isCalled ?name.?who source:wasBornIn ?where}

Query:



hasWonPrize

wasBornIn


Ulm

http://www.mpii.de/yago/resource/


isCalled

hasWonPrize

wasBornIn

isCalled

name where

Albert Einstein Ulm


Ulm

6

RDF knowledge base…

Semantic web , Web2.0Extract Knowledge from the Web

– YAGO– DBpedia– Freebase– Billion Triple Challenge…

7

RDF knowledge base

295 data sets31 billion RDF triples504 million RDF links

(September 2011)

8

Challenge and Opportunity

Challenge– The RDF data is growing rapidly. Researchers are working with billi

ons of triples.– Relational database has limited ability on scalability.

Opportunity– Google GFS, MapReduce, BigTable– Hadoop: implementation of the MapReduce framework and HDFS– Achievements:Yahoo！， Amazon，腾讯，百度，淘宝 ......

We need to consider the recent achievements for handling massive scale Web data on clusters

9

MapReduce： word count file1: the weather is good file2: today is good flie3: good weather is good.

Worker 1:

(the 1), (weather 1),

(is 1), (good 1). Worker 2:

(today 1), (is 1), (good 1). Worker 3:

(good 1), (weather 1),

(is 1), (good 1).

Worker 1:

(the 1) Worker 2:

(is 1), (is 1), (is 1) Worker 3:

(weather 1), (weather 1) Worker 4:

(today 1) Worker 5:

(good 1), (good 1),

(good 1), (good 1)

Worker 1:

(the 1) Worker 2:

(is 3) Worker 3:

(weather 2) Worker 4:

(today 1) Worker 5:

(good 4)

Map output Reduce Input Reduce Output

Map(k1,v1) → list(k2,v2) Reduce(k2, list (v2)) → list(k3,v3)

10

Outline


11

Solution 1

Directly map the SPARQL into a sequence of MapReduce Jobs

Pro.– scalable

Con.– a burden on the user in terms of usage and maintenanc

e– Not support complex query– No index– Not consider the RDF data characteristics

12

Solution 2

Map the SPARQL to Pig -> MapReduce Jobs

Pro.– Scalable– Support complex query

Con.– No index– Not consider the RDF data characteristics

13

Outline


14

Architecture overview

Map-Reduce Runtime

HDFS

JSON Data Model

Cluster Deployment and Management

JAQL Query Language

SPARQL Translator

Transform Filter Join Sort Group Built-in Functions

BGP Union Filter Optional RDF 2 JSON

Loader

Optimizer

15

JSON

JSON (JavaScript Object Notation) is a lightweight data-interchange format

It is based on a subset of the JavaScript Programming Language

JSON is built on two structures:– A collection of name/value (Key/value) pairs– An ordered list of values (array)

16

RDF to JSONRDF triple JSON format

Albert Einstein isCalled Albert EinsteinAlbert Einstein isCalled 阿尔伯特•爱因斯坦Albert Einstein wasBornIn UlmAlbert Einstein wasBornOnDate 1879-03-14Albert Einstein hasWonPrize Nobel Prize in PhysicsAlbert Einstein diedOnDate 1955-04-18

[{s:Albert Einstein, p:isCalled, o:Albert Einstein },{s:Albert Einstein, p:isCalled, o: 阿尔伯特•爱因斯坦 },{s:Albert Einstein, p:wasBornIn, o:Ulm },{s:Albert Einstein, p:wasBornOnDate, o:1879-03-14 },{s:Albert Einstein, p:hasWonPrize, o:Nobel Prize in Physics },{s:Albert Einstein, p:diedOnDate, o:1955-04-18 }]

JSON is built on two structures:– name/value (Key/value) pairs {s:Albert Einstein}

– list of values(array) [{s:Albert Einstein},{}…]

17

JAQL

JAQL is an open-source language for querying JSON (JavaScript Object Notation) data.

It provides a general parallel data processing platform on Hadoop

Developed by IBM

18

Basic Idea

SPARQL can be supported on Hadoop by translating queries into JAQL operators

Filter

Transform

Join

Group

Sort

Built-in Function merge (d1, d2), regex(), etc

19

SPARQL to JAQLTransformation

SPARQL Query

PREFIX source:<http://www.mpii.de/yago/resource/>SELECT ?name ?whereWHERE {?who source:hasWonPrize Nobel Prize in Physics.?who source:isCalled ?name.?who source:wasBornIn ?where.}

JAQL Query

//read files from hdfs by predicate name $1 = read(hdfs('source:hasWonPrize')) -> filter $.o == “Nobel Prize in Physics” //select -> transform {$.s}; //project$2 = read(hdfs('source:isCalled')) -> transform {$.s,$.o};$3 = read(hdfs('source:wasBornIn')) -> transform {$.s,$.o};//mult-joinjoin $1, $2, $3 where $1.s == $2.sand $2.s == $3.s into { name:$2.o, where:$3.o }; //project to ?name ?where

{s:Albert Einstein, p:isCalled, o:Albert Einstein }

1

2

3

1

2

3

4

Mapreduce job1

Mapreduce job2

Mapreduce job3

Mapreduce job4

20

Data storage

In Hadoop framework, – a file is the smallest unit of input to a MapReduc

e job and read from the disk.

One straightforward partitioning strategy is to store all the data in one file– Must scan the entire data in the read operation

Data Partitioning Strategy

21

Data Partitioning Strategy

Horizontal partitioningVertical partitioningClustered property partitioning

22

Horizontal partitioning with JSON

For example

Store in HDFS

Albert Einstein isCalled Albert EinsteinAlbert Einstein isCalled 阿尔伯特•爱因斯坦Albert Einstein wasBornIn UlmAlbert Einstein wasBornOnDate 1879-03-14Albert Einstein hasWonPrize Nobel Prize in PhysicsAlbert Einstein diedOnDate 1955-04-18Charles K. Kao hasWonPrize Nobel Prize in PhysicsCharles K. Kao wasBornIn ShanghaiFaye Wong hasWonPrizeMTV Video Music AwardsFaye Wong wasBornIn Beijing

File 1 File name: Hash(Subject1)

[{s:Albert Einstein, p:isCalled, o:Albert Einstein },{s:Albert Einstein, p:isCalled, o:阿尔伯特•爱因斯坦 },{s:Albert Einstein, p:wasBornIn, o:Ulm },{s:Albert Einstein, p:wasBornOnDate, o:1879-03-14 },{s:Albert Einstein, p:hasWonPrize, o:Nobel Prize in Physics },{s:Albert Einstein, p:diedOnDate, o:1955-04-18 }]


[{s:Charles K. Kao , p:hasWonPrize, o:Nobel Prize in Physics },{s:Charles K. Kao , p:wasBornIn, o:Shanghai }]


[{s:Faye Wong, p:hasWonPrize, o:MTV Video Music Awards },{s:Faye Wong, p:wasBornIn, o:Beijing}]

23

Vertical Partitioning with JSON

For example

Store in HDFSFile 1 File name: isCalled

[{s:Albert Einstein, o:Albert Einstein },{s:Albert Einstein, o:阿尔伯特•爱因斯坦 }]

File 2 File name: wasBornIn

[{s:Albert Einstein, o:Ulm },{s:Charles K. Kao , o:Shanghai},{s:Faye Wong, o:Beijing}]]

File 5 File name: diedOnDate

[{s:Albert Einstein, o:1955-04-18 }]

File 3 File name: wasBornOnDate


File 4 File name: hasWonPrize

[{s:Albert Einstein, o:Nobel Prize in Physics },{s:Charles K. Kao , o:Nobel Prize in Physics },{s:Faye Wong, o:MTV Video Music Awards }]


24

Clustered property partitioning with JSON

For example

Store in HDFS


File 1 File name: cluster1



[{s:Charles K. Kao , p:hasWonPrize, o:Nobel Prize in Physics },{s:Charles K. Kao , p:wasBornIn, o:Shanghai },{s:Faye Wong, p:hasWonPrize, o:MTV Video Music Awards },{s:Faye Wong, p:wasBornIn, o:Beijing}]

25

Partition Index: Vertical Partitioning

Inverted Indexs

s File list

Albert Einstein isCalled,wasBornIn,wasBornOnDate, hasWonPrize,diedOnDate

……

Inverted Indexs

o File list

Albert Einstein isCalled,

…….

File 1 File name: isCalled

[{s:Albert Einstein, o:Albert Einstein },{s:Albert Einstein, o:阿尔伯特•爱因斯坦 }]

File 2 File name: wasBornIn

[{s:Albert Einstein, o:Ulm },{s:Charles K. Kao , o:Shanghai},{s:Faye Wong, o:Beijing}]

File 5 File name: diedOnDate


File 3 File name: wasBornOnDate


File 4 File name: hasWonPrize

[{s:Albert Einstein, o:Nobel Prize in Physics },{s:Charles K. Kao , o:Nobel Prize in Physics },{s:Faye Wong, o:MTV Video Music Awards }]

26

Partition Index: Horizontal partitioning

Inverted Indexs

p File list

isCalled Hash(Subject1)

……

Inverted Indexs

o File list

Nobel Prize in Physics Hash(Subject1),Hash(Subject2)

……




[{s:Charles K. Kao , p:hasWonPrize, o:Nobel Prize in Physics },{s:Charles K. Kao , p:wasBornIn, o:Shanghai }]


[{s:Faye Wong, p:hasWonPrize, o:MTV Video Music Awards },{s:Faye Wong, p:wasBornIn, o:Beijing}]

27

Partition Index: Clustered property partitioning




[{s:Charles K. Kao , p:hasWonPrize, o:Nobel Prize in Physics },{s:Charles K. Kao , p:wasBornIn, o:Shanghai },{s:Faye Wong, p:hasWonPrize, o:MTV Video Music Awards },{s:Faye Wong, p:wasBornIn, o:Beijing}]

Inverted Indexs

p File list

isCalled cluster1

……

Inverted Indexs

o File list

Albert Einstein cluster1

……

Inverted Indexs

s File list

Albert Einstein cluster1

Charles K. Kao cluster2

Faye Wong Cluster2

28

Outline


29

Experiments

Dataset:Billion Triples Challenge 2010(BTC10) . 3.2B <s, p, o, q> quads,624 GBs;The resulted of dataset have

1,426,823,976 unique triples;

Hadoop 0.20.2.Ubuntu 10.04.linux 2.6.32-24-server 64bit. 30nodes: One node is a master, and the others are slaves 47G memory, 4.3TB disk space and 24 processor of Intel(R) Xeon(R)

CPU E5645@ 2.40GHz “dfs.replication” is 2

JAQL is 0.5.1 version Java 1.6

30

Experiments

Fig. Distribution of data

31

Experiments

Fig. Cost time of each query

32

Outline


33

Conclusion

Solution for SPARQL queries in MapReduce Transforming the queries to JAQL operators running on Hadoop.

Transformation of SPARQL to JAQL Filter, Transform, Join ……

Data Partitioning Strategy Horizontal partitioning Vertical partitioning Clustered property partitioning

Experiments show the performance Clustered property partitioning has best performance Horizontal partitioning is the worst one

34

Scalability

RDBMS: Waits and deadlocks are increasing nonlinearly with the size

of the transactions and concurrency. Scale-up(Vertical scaling):Commercial RDBMSes are very, ve

ry expensive Schema:Structured data

MapReduce Linear, High throughput Scale-out (horizontal scaling) Schema-free: Unstructured data

35

RDBMS V.S MapReduce

Traditional RDBMS MapReduce

Data size Gigabytes Petabytes

Access Interactive and batch Batch

Updates Read and write many times Write once, read many times

Structure Static schema Dynamic schema

Integrity High Low

Scaling Nonlinear Linear

Table . RDBMS compared to MapReduce

36

Limit of hadoop

The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines

The MapReduce JobTracker needs a drastic overhaul to address several deficiencies in its scalability, memory consumption, threading-model, reliability and performance

The Next Generation of Apache Hadoop MapReduce

Divide the two major functions of the JobTracker, resource management and job scheduling/monitoring, into separate components.

ResourceManager ApplicationMaster

Reliability

Availability

Scalability–beyond 10,000 machines

Backward (and Forward) Compatibility

Evolution –for customers to control upgrades

Predictable Latency

Cluster utilization

38

Conclusion

Hadoop(MapReduce)– Pro.

Scalable High throughput

– Con. Expense of laten

cy No index No more than 40

00 nodes

SPARQL on Cloud– Pro.

Scalable High throughput

– Con. Expense of latency Complex query:JAQL Join operation

39

Thanks!

40

Sparql query

Q1:select?X ?Y where{?X rdfs:label Albert Einstein. ?X smc:page ?Y. ?X rdf:type smc:Subject. }

Q2:select ?x ?y ?z where { dbsc:Ulm rdf:type ?x. ?x rdfs:label ?y. ?x rdfs:comment ?z. }

Q3:select? Who ?Y ?date1 ?Z ?date2 ?prize where{?who source:bornIn ?Y.?who source:bornOnDate?date1.?whosource:diedIn?Z.?whosource:diedOnDate ?date2. ?who source:hasWonPrize ?prize. }

Q4:select ?x ?author ?title where {?x purl:hasAuthor ?author. ?x purl:hasBooktitle ISWC 2009. ?x purl:hasTitle ?title.}

Q5:select distinct ?name ?lat ?long ?pop where {?a property:name ?name.?a property:regoin dbsc: Nord-Pas-de-Calais.a pos:lat ?lat.?a pos:long ?long.?a property:population ?pop. }

41

Sparql query

Q6: select ?bn ?b ?p where{ ?a property:name ?bn. ?a property:dateOfBirth ?b. ?a property:placeOfBirth ?p. }

Q7:select ?Y ?type ?prize where{source:Albert_Einstein source:bornIn ?Y. source:Albert_Einsteinrdf:type?type.source:Albert_Einstein source:hasWonPrize ?prize. }

Q8:select ?a ?type ?pub where{?a rdf:type ?type.?a semweb:publisher ?pub.?a semweb:periodical_title Theory of Computing Systems.}

Q9:select distinct ?a ?lat ?long ?pop where{?a geo:ontology#name Chevilly.?a geo:ontology#inCountry geo:countries#FR.?a pos:lat ?lat.?a pos:long ?long.?a geo:ontology#population ?pop.}

Q10:select distinct ?l ?long ?lat where{?a property:placeOfBirth ?l.?l pos:lat ?lat.?l pos:long ?long.}

42

Q3, Q10 are star join queries with poplar predicates and unspecified object

Q1, Q4, Q5, Q6, Q8, Q9 are also star join but with one or more known object.

Q2 is a chain query The value of subject is literals in Q7

Sparql query

1 efficient sparql query processing in mapreduce through data partitioning and indexing nie zhi...

Documents

good weather

rdf data characteristics

rdf triples504

rdf linksseptember

data partitioning

sparql query language

rdf knowledge basesemantic

mapreduce framework