1 efficient sparql query processing in mapreduce through data partitioning and indexing nie zhi...
TRANSCRIPT
1
Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing
2
Outline
Introduction Related work SPARQL Query Processing in MapReduce Experiments Conclusion
3
Outline
Introduction Related work SPARQL Query Processing in MapReduce Experiments Conclusion
4
RDF
Resource Description Framework subject-predicate-object expressions (S-P-O)
Nobel Prize in Physics
阿尔伯特•爱因斯坦
Albert EinsteinisCalled
hasWonPrize
wasBornIn
Albert EinsteinisCalled
Ulm
http://www.mpii.de/yago/resource/
Nobel Prize in Physics
Albert EinsteinisCalled
hasWonPrize
wasBornIn
isCalled
S
OP
5
SPARQL Query Language for RDF
PREFIX source:<http://www.mpii.de/yago/resource/>SELECT ?name ?whereWHERE {?who source:hasWonPrize Nobel Prize in Physics.?who source:isCalled ?name.?who source:wasBornIn ?where}
Query:
阿尔伯特•爱因斯坦
Albert EinsteinisCalled
hasWonPrize
wasBornIn
Albert EinsteinisCalled
Ulm
http://www.mpii.de/yago/resource/
Nobel Prize in Physics
isCalled
hasWonPrize
wasBornIn
isCalled
name where
Albert Einstein Ulm
阿尔伯特•爱因斯坦
Ulm
6
RDF knowledge base…
Semantic web , Web2.0Extract Knowledge from the Web
– YAGO– DBpedia– Freebase– Billion Triple Challenge…
7
RDF knowledge base
295 data sets31 billion RDF triples504 million RDF links
(September 2011)
8
Challenge and Opportunity
Challenge– The RDF data is growing rapidly. Researchers are working with billi
ons of triples.– Relational database has limited ability on scalability.
Opportunity– Google GFS, MapReduce, BigTable– Hadoop: implementation of the MapReduce framework and HDFS– Achievements:Yahoo!, Amazon,腾讯,百度,淘宝 ......
We need to consider the recent achievements for handling massive scale Web data on clusters
9
MapReduce: word count file1: the weather is good file2: today is good flie3: good weather is good.
Worker 1:
(the 1), (weather 1),
(is 1), (good 1). Worker 2:
(today 1), (is 1), (good 1). Worker 3:
(good 1), (weather 1),
(is 1), (good 1).
Worker 1:
(the 1) Worker 2:
(is 1), (is 1), (is 1) Worker 3:
(weather 1), (weather 1) Worker 4:
(today 1) Worker 5:
(good 1), (good 1),
(good 1), (good 1)
Worker 1:
(the 1) Worker 2:
(is 3) Worker 3:
(weather 2) Worker 4:
(today 1) Worker 5:
(good 4)
Map output Reduce Input Reduce Output
Map(k1,v1) → list(k2,v2) Reduce(k2, list (v2)) → list(k3,v3)
10
Outline
Introduction Related work SPARQL Query Processing in MapReduce Experiments Conclusion
11
Solution 1
Directly map the SPARQL into a sequence of MapReduce Jobs
Pro.– scalable
Con.– a burden on the user in terms of usage and maintenanc
e– Not support complex query– No index– Not consider the RDF data characteristics
12
Solution 2
Map the SPARQL to Pig -> MapReduce Jobs
Pro.– Scalable– Support complex query
Con.– No index– Not consider the RDF data characteristics
13
Outline
Introduction Related work SPARQL Query Processing in MapReduce Experiments Conclusion
14
Architecture overview
Map-Reduce Runtime
HDFS
JSON Data Model
Cluster Deployment and Management
JAQL Query Language
SPARQL Translator
Transform Filter Join Sort Group Built-in Functions
BGP Union Filter Optional RDF 2 JSON
Loader
Optimizer
15
JSON
JSON (JavaScript Object Notation) is a lightweight data-interchange format
It is based on a subset of the JavaScript Programming Language
JSON is built on two structures:– A collection of name/value (Key/value) pairs– An ordered list of values (array)
16
RDF to JSONRDF triple JSON format
Albert Einstein isCalled Albert EinsteinAlbert Einstein isCalled 阿尔伯特•爱因斯坦Albert Einstein wasBornIn UlmAlbert Einstein wasBornOnDate 1879-03-14Albert Einstein hasWonPrize Nobel Prize in PhysicsAlbert Einstein diedOnDate 1955-04-18
[{s:Albert Einstein, p:isCalled, o:Albert Einstein },{s:Albert Einstein, p:isCalled, o: 阿尔伯特•爱因斯坦 },{s:Albert Einstein, p:wasBornIn, o:Ulm },{s:Albert Einstein, p:wasBornOnDate, o:1879-03-14 },{s:Albert Einstein, p:hasWonPrize, o:Nobel Prize in Physics },{s:Albert Einstein, p:diedOnDate, o:1955-04-18 }]
JSON is built on two structures:– name/value (Key/value) pairs {s:Albert Einstein}
– list of values(array) [{s:Albert Einstein},{}…]
17
JAQL
JAQL is an open-source language for querying JSON (JavaScript Object Notation) data.
It provides a general parallel data processing platform on Hadoop
Developed by IBM
18
Basic Idea
SPARQL can be supported on Hadoop by translating queries into JAQL operators
Filter
Transform
Join
Group
Sort
Built-in Function merge (d1, d2), regex(), etc
19
SPARQL to JAQLTransformation
SPARQL Query
PREFIX source:<http://www.mpii.de/yago/resource/>SELECT ?name ?whereWHERE {?who source:hasWonPrize Nobel Prize in Physics.?who source:isCalled ?name.?who source:wasBornIn ?where.}
JAQL Query
//read files from hdfs by predicate name $1 = read(hdfs('source:hasWonPrize')) -> filter $.o == “Nobel Prize in Physics” //select -> transform {$.s}; //project$2 = read(hdfs('source:isCalled')) -> transform {$.s,$.o};$3 = read(hdfs('source:wasBornIn')) -> transform {$.s,$.o};//mult-joinjoin $1, $2, $3 where $1.s == $2.sand $2.s == $3.s into { name:$2.o, where:$3.o }; //project to ?name ?where
{s:Albert Einstein, p:isCalled, o:Albert Einstein }
1
2
3
1
2
3
4
Mapreduce job1
Mapreduce job2
Mapreduce job3
Mapreduce job4
20
Data storage
In Hadoop framework, – a file is the smallest unit of input to a MapReduc
e job and read from the disk.
One straightforward partitioning strategy is to store all the data in one file– Must scan the entire data in the read operation
Data Partitioning Strategy
21
Data Partitioning Strategy
Horizontal partitioningVertical partitioningClustered property partitioning
22
Horizontal partitioning with JSON
For example
Store in HDFS
Albert Einstein isCalled Albert EinsteinAlbert Einstein isCalled 阿尔伯特•爱因斯坦Albert Einstein wasBornIn UlmAlbert Einstein wasBornOnDate 1879-03-14Albert Einstein hasWonPrize Nobel Prize in PhysicsAlbert Einstein diedOnDate 1955-04-18Charles K. Kao hasWonPrize Nobel Prize in PhysicsCharles K. Kao wasBornIn ShanghaiFaye Wong hasWonPrizeMTV Video Music AwardsFaye Wong wasBornIn Beijing
File 1 File name: Hash(Subject1)
[{s:Albert Einstein, p:isCalled, o:Albert Einstein },{s:Albert Einstein, p:isCalled, o:阿尔伯特•爱因斯坦 },{s:Albert Einstein, p:wasBornIn, o:Ulm },{s:Albert Einstein, p:wasBornOnDate, o:1879-03-14 },{s:Albert Einstein, p:hasWonPrize, o:Nobel Prize in Physics },{s:Albert Einstein, p:diedOnDate, o:1955-04-18 }]
File 2 File name: Hash(Subject2)
[{s:Charles K. Kao , p:hasWonPrize, o:Nobel Prize in Physics },{s:Charles K. Kao , p:wasBornIn, o:Shanghai }]
File 3 File name: Hash(Subject3)
[{s:Faye Wong, p:hasWonPrize, o:MTV Video Music Awards },{s:Faye Wong, p:wasBornIn, o:Beijing}]
23
Vertical Partitioning with JSON
For example
Store in HDFSFile 1 File name: isCalled
[{s:Albert Einstein, o:Albert Einstein },{s:Albert Einstein, o:阿尔伯特•爱因斯坦 }]
File 2 File name: wasBornIn
[{s:Albert Einstein, o:Ulm },{s:Charles K. Kao , o:Shanghai},{s:Faye Wong, o:Beijing}]]
File 5 File name: diedOnDate
[{s:Albert Einstein, o:1955-04-18 }]
File 3 File name: wasBornOnDate
[{s:Albert Einstein, o:1879-03-14 }]
File 4 File name: hasWonPrize
[{s:Albert Einstein, o:Nobel Prize in Physics },{s:Charles K. Kao , o:Nobel Prize in Physics },{s:Faye Wong, o:MTV Video Music Awards }]
Albert Einstein isCalled Albert EinsteinAlbert Einstein isCalled 阿尔伯特•爱因斯坦Albert Einstein wasBornIn UlmAlbert Einstein wasBornOnDate 1879-03-14Albert Einstein hasWonPrize Nobel Prize in PhysicsAlbert Einstein diedOnDate 1955-04-18Charles K. Kao hasWonPrize Nobel Prize in PhysicsCharles K. Kao wasBornIn ShanghaiFaye Wong hasWonPrizeMTV Video Music AwardsFaye Wong wasBornIn Beijing
24
Clustered property partitioning with JSON
For example
Store in HDFS
Albert Einstein isCalled Albert EinsteinAlbert Einstein isCalled 阿尔伯特•爱因斯坦Albert Einstein wasBornIn UlmAlbert Einstein wasBornOnDate 1879-03-14Albert Einstein hasWonPrize Nobel Prize in PhysicsAlbert Einstein diedOnDate 1955-04-18Charles K. Kao hasWonPrize Nobel Prize in PhysicsCharles K. Kao wasBornIn ShanghaiFaye Wong hasWonPrizeMTV Video Music AwardsFaye Wong wasBornIn Beijing
File 1 File name: cluster1
[{s:Albert Einstein, p:isCalled, o:Albert Einstein },{s:Albert Einstein, p:isCalled, o:阿尔伯特•爱因斯坦 },{s:Albert Einstein, p:wasBornIn, o:Ulm },{s:Albert Einstein, p:wasBornOnDate, o:1879-03-14 },{s:Albert Einstein, p:hasWonPrize, o:Nobel Prize in Physics },{s:Albert Einstein, p:diedOnDate, o:1955-04-18 }]
File 2 File name: cluster2
[{s:Charles K. Kao , p:hasWonPrize, o:Nobel Prize in Physics },{s:Charles K. Kao , p:wasBornIn, o:Shanghai },{s:Faye Wong, p:hasWonPrize, o:MTV Video Music Awards },{s:Faye Wong, p:wasBornIn, o:Beijing}]
25
Partition Index: Vertical Partitioning
Inverted Indexs
s File list
Albert Einstein isCalled,wasBornIn,wasBornOnDate, hasWonPrize,diedOnDate
……
Inverted Indexs
o File list
Albert Einstein isCalled,
…….
File 1 File name: isCalled
[{s:Albert Einstein, o:Albert Einstein },{s:Albert Einstein, o:阿尔伯特•爱因斯坦 }]
File 2 File name: wasBornIn
[{s:Albert Einstein, o:Ulm },{s:Charles K. Kao , o:Shanghai},{s:Faye Wong, o:Beijing}]
File 5 File name: diedOnDate
[{s:Albert Einstein, o:1955-04-18 }]
File 3 File name: wasBornOnDate
[{s:Albert Einstein, o:1879-03-14 }]
File 4 File name: hasWonPrize
[{s:Albert Einstein, o:Nobel Prize in Physics },{s:Charles K. Kao , o:Nobel Prize in Physics },{s:Faye Wong, o:MTV Video Music Awards }]
26
Partition Index: Horizontal partitioning
Inverted Indexs
p File list
isCalled Hash(Subject1)
……
Inverted Indexs
o File list
Nobel Prize in Physics Hash(Subject1),Hash(Subject2)
……
File 1 File name: Hash(Subject1)
[{s:Albert Einstein, p:isCalled, o:Albert Einstein },{s:Albert Einstein, p:isCalled, o:阿尔伯特•爱因斯坦 },{s:Albert Einstein, p:wasBornIn, o:Ulm },{s:Albert Einstein, p:wasBornOnDate, o:1879-03-14 },{s:Albert Einstein, p:hasWonPrize, o:Nobel Prize in Physics },{s:Albert Einstein, p:diedOnDate, o:1955-04-18 }]
File 2 File name: Hash(Subject2)
[{s:Charles K. Kao , p:hasWonPrize, o:Nobel Prize in Physics },{s:Charles K. Kao , p:wasBornIn, o:Shanghai }]
File 3 File name: Hash(Subject3)
[{s:Faye Wong, p:hasWonPrize, o:MTV Video Music Awards },{s:Faye Wong, p:wasBornIn, o:Beijing}]
27
Partition Index: Clustered property partitioning
File 1 File name: cluster1
[{s:Albert Einstein, p:isCalled, o:Albert Einstein },{s:Albert Einstein, p:isCalled, o:阿尔伯特•爱因斯坦 },{s:Albert Einstein, p:wasBornIn, o:Ulm },{s:Albert Einstein, p:wasBornOnDate, o:1879-03-14 },{s:Albert Einstein, p:hasWonPrize, o:Nobel Prize in Physics },{s:Albert Einstein, p:diedOnDate, o:1955-04-18 }]
File 2 File name: cluster2
[{s:Charles K. Kao , p:hasWonPrize, o:Nobel Prize in Physics },{s:Charles K. Kao , p:wasBornIn, o:Shanghai },{s:Faye Wong, p:hasWonPrize, o:MTV Video Music Awards },{s:Faye Wong, p:wasBornIn, o:Beijing}]
Inverted Indexs
p File list
isCalled cluster1
……
Inverted Indexs
o File list
Albert Einstein cluster1
……
Inverted Indexs
s File list
Albert Einstein cluster1
Charles K. Kao cluster2
Faye Wong Cluster2
28
Outline
Introduction Related work SPARQL Query Processing in MapReduce Experiments Conclusion
29
Experiments
Dataset:Billion Triples Challenge 2010(BTC10) . 3.2B <s, p, o, q> quads,624 GBs;The resulted of dataset have
1,426,823,976 unique triples;
Hadoop 0.20.2.Ubuntu 10.04.linux 2.6.32-24-server 64bit. 30nodes: One node is a master, and the others are slaves 47G memory, 4.3TB disk space and 24 processor of Intel(R) Xeon(R)
CPU E5645@ 2.40GHz “dfs.replication” is 2
JAQL is 0.5.1 version Java 1.6
30
Experiments
Fig. Distribution of data
31
Experiments
Fig. Cost time of each query
32
Outline
Introduction Related work SPARQL Query Processing in MapReduce Experiments Conclusion
33
Conclusion
Solution for SPARQL queries in MapReduce Transforming the queries to JAQL operators running on Hadoop.
Transformation of SPARQL to JAQL Filter, Transform, Join ……
Data Partitioning Strategy Horizontal partitioning Vertical partitioning Clustered property partitioning
Experiments show the performance Clustered property partitioning has best performance Horizontal partitioning is the worst one
34
Scalability
RDBMS: Waits and deadlocks are increasing nonlinearly with the size
of the transactions and concurrency. Scale-up(Vertical scaling):Commercial RDBMSes are very, ve
ry expensive Schema:Structured data
MapReduce Linear, High throughput Scale-out (horizontal scaling) Schema-free: Unstructured data
35
RDBMS V.S MapReduce
Traditional RDBMS MapReduce
Data size Gigabytes Petabytes
Access Interactive and batch Batch
Updates Read and write many times Write once, read many times
Structure Static schema Dynamic schema
Integrity High Low
Scaling Nonlinear Linear
Table . RDBMS compared to MapReduce
36
Limit of hadoop
The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines
The MapReduce JobTracker needs a drastic overhaul to address several deficiencies in its scalability, memory consumption, threading-model, reliability and performance
The Next Generation of Apache Hadoop MapReduce
Divide the two major functions of the JobTracker, resource management and job scheduling/monitoring, into separate components.
ResourceManager ApplicationMaster
Reliability
Availability
Scalability–beyond 10,000 machines
Backward (and Forward) Compatibility
Evolution –for customers to control upgrades
Predictable Latency
Cluster utilization
38
Conclusion
Hadoop(MapReduce)– Pro.
Scalable High throughput
– Con. Expense of laten
cy No index No more than 40
00 nodes
SPARQL on Cloud– Pro.
Scalable High throughput
– Con. Expense of latency Complex query:JAQL Join operation
39
Thanks!
40
Sparql query
Q1:select?X ?Y where{?X rdfs:label Albert Einstein. ?X smc:page ?Y. ?X rdf:type smc:Subject. }
Q2:select ?x ?y ?z where { dbsc:Ulm rdf:type ?x. ?x rdfs:label ?y. ?x rdfs:comment ?z. }
Q3:select? Who ?Y ?date1 ?Z ?date2 ?prize where{?who source:bornIn ?Y.?who source:bornOnDate?date1.?whosource:diedIn?Z.?whosource:diedOnDate ?date2. ?who source:hasWonPrize ?prize. }
Q4:select ?x ?author ?title where {?x purl:hasAuthor ?author. ?x purl:hasBooktitle ISWC 2009. ?x purl:hasTitle ?title.}
Q5:select distinct ?name ?lat ?long ?pop where {?a property:name ?name.?a property:regoin dbsc: Nord-Pas-de-Calais.a pos:lat ?lat.?a pos:long ?long.?a property:population ?pop. }
41
Sparql query
Q6: select ?bn ?b ?p where{ ?a property:name ?bn. ?a property:dateOfBirth ?b. ?a property:placeOfBirth ?p. }
Q7:select ?Y ?type ?prize where{source:Albert_Einstein source:bornIn ?Y. source:Albert_Einsteinrdf:type?type.source:Albert_Einstein source:hasWonPrize ?prize. }
Q8:select ?a ?type ?pub where{?a rdf:type ?type.?a semweb:publisher ?pub.?a semweb:periodical_title Theory of Computing Systems.}
Q9:select distinct ?a ?lat ?long ?pop where{?a geo:ontology#name Chevilly.?a geo:ontology#inCountry geo:countries#FR.?a pos:lat ?lat.?a pos:long ?long.?a geo:ontology#population ?pop.}
Q10:select distinct ?l ?long ?lat where{?a property:placeOfBirth ?l.?l pos:lat ?lat.?l pos:long ?long.}
42
Q3, Q10 are star join queries with poplar predicates and unspecified object
Q1, Q4, Q5, Q6, Q8, Q9 are also star join but with one or more known object.
Q2 is a chain query The value of subject is literals in Q7
Sparql query