a workload-aware middleware for storing massive rdf graphs into nosql databases

45
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases Exame de Qualificação de Doutorado Luiz Henrique Zambom Santana Prof. Dr. Ronaldo dos Santos Mello orientador UFSC/CTC/INE/PPGCC

Upload: luiz-henrique-zambom-santana

Post on 28-Jan-2018

265 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Exame de Qualificação de Doutorado

Luiz Henrique Zambom Santana

Prof. Dr. Ronaldo dos Santos Melloorientador

UFSC/CTC/INE/PPGCC

Page 2: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Agenda● Introduction: Motivation, objectives, and contributions● Background

○ RDF○ NoSQL

● State of the Art○ Open Issues

● Rendezvous○ Storing: Fragmentation, Indexing, Partitioning, and Mapping○ Querying: Query decomposition and Caching

● Evaluation● Schedule

Page 3: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Introduction: Motivation● RDF is currently widespread:

○ Best buy:■ http://www.nytimes.com/external/readwriteweb/2010/07/01/01readwriteweb-how-best-bu

y-is-using-the-semantic-web-23031.html○ Globo.com:

■ https://www.slideshare.net/icaromedeiros/apresantacao-ufrj-icaro2013○ US data.gov:

■ https://www.data.gov/developers/semantic-web

Page 4: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Introduction: Motivation (LOD stats)

Page 5: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Introduction: Objectives

This PhD Thesis proposal presents Rendezvous, a middleware for storing massing RDF graphs. This

middleware includes a novel data partitioning approach, a fragmentation strategy that maps pieces of this RDF graph into NoSQL databases with different data models, and a caching structure that accelerate the querying response.

Page 6: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Introduction: Contributions● (i) a mapping of RDF data to the columnar, document, and key/value NoSQL

data models (SADALAGE; FOWLER, 2012)● (ii) a workload-aware partitioner based on the current graph structure and,

mainly, on the typical application workload ● (iii) a caching schema based on key/value databases for speeding up the

query response time ● (iv) an experimental evaluation that compares the current version of our

approach against two baselines (Rainbow (GU; HU; HUANG, 2015) and ScalaRDF (HU et al., )) by considering Redis, Apache Cassandra and MongoDB, the most popular key/value, columnar and document NoSQL databases, respectively

Page 7: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Agenda● Introduction: Motivation, objectives, and contributions● Background

○ RDF○ NoSQL

● State of the Art○ Open Issues

● Rendezvous○ Storing: Fragmentation, Indexing, Partitioning, and Mapping○ Querying: Query decomposition and Caching

● Evaluation● Schedule

Page 8: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Background: RDF and SPARQL

Page 9: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Background: NoSQL

Page 10: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Agenda● Introduction: Motivation, objectives, and contributions● Background

○ RDF○ NoSQL

● State of the Art○ Open Issues

● Rendezvous○ Storing: Fragmentation, Indexing, Partitioning, and Mapping○ Querying: Query decomposition and Caching

● Evaluation● Schedule

Page 11: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

State of the Art - No NoSQL Triplestores

WARP (h-hop replication), YARS, Hexastore (multiple indexes), 4store, SPIDER, RDF-3X,

SHARD, SW-Store (vertical partition), SOLID, SPOVC (horizontal partition), and S2X

Page 12: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

State of the Art - NoSQL Triplestores

Page 13: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

State of the Art - NoSQL Triplestores

RDFJoin, RDFKB, Jena+HBase, Hive+HBase, CumulusRDF, Rya, Stratustore, MAPSIN, H2RDF, AMADA, Trinity.RDF,

H2RDF+, MonetDBRDF, xR2RML, W3C RDF/JSON, Rainbow, Sempala, PrestoRDF, RDFChain, Tomaszuk, Bouhali, and Laurent, Papailiou et al., and, ScalaRDF.

Page 14: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

State of the Art - Categories

● RDF/NoSQL Converters● Polystores/Multimodel● In-memory

Rainbow (GU; HU; HUANG, 2015)

Amada (Aranda-Andújar, 2012)

Page 15: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

State of the Art● BUGIOTTI, F. et al. Invisible glue:

scalable self-tuning multi-stores. In: Conference on Innovative Data Systems Research (CIDR). [S.l.: s.n.], 2015.

Page 16: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

State of the Art - Open Issues● To avoid indexing all the triple component permutations● To consider workload and the usage of statistics for data partitioning● To exploit in-memory possibilities● To combine RDF storage with multiple NoSQL models

Page 17: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Agenda● Introduction: Motivation, objectives, and contributions● Background

○ RDF○ NoSQL

● State of the Art○ Open Issues

● Rendezvous○ Storing: Fragmentation, Indexing, Partitioning, and Mapping○ Querying: Query decomposition and Caching

● Evaluation● Schedule

Page 18: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Rendezvous: Architecture

Page 19: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Rendezvous: Storing● Fragmentation● Indexing● Partitioning● Mapping

Page 20: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Storing: Fragmentation

Page 21: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Storing: Fragmentation

Page 22: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Storing: Fragmentation

Page 23: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Storing: Fragmentation

Page 24: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Storing: Fragmentation

Page 25: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Storing: Indexing

Page 26: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Storing: Indexing - Simple queries and fragmentation

Page 27: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Storing: Partitioning● The partition is used when the dataset is bigger than each server capabilities

Page 28: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Storing: Partitioning

Page 29: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Storing: Partitioning

Page 30: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Rendezvous: Querying

● Query decomposition● Caching

Page 31: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Querying: Decomposition

Page 32: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Querying: DecompositionQ1: SELECT ?x WHERE {x? p5 y? . y? p2 z? .}

Q2: SELECT ?x WHERE {x? p9 y? . M p10 y? .}

D1: db.partition1.find({p5:{$exists:true}, p2:{$exists:true}}})

D2: db.partition1.find({p9:{$exists:true}, subject:M}})

Page 33: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Querying: DecompositionQ3: SELECT ?x WHERE { x? p1 y? . y? p2 z? . z? p3 w? . }

C1: SELECT S1,O1 FROM p1 SELECT S2,O2 FROM p2 WHERE O=S1 SELECT S3,O3 FROM p3 WHERE O=S2 AND S=D (C1)

Q4: SELECT ?x WHERE {x? p2 y? .y? p3 z? .x? p5 w? .w? p9 k? .L p11 k?.}

SQ5: SELECT ?x WHERE {x? p2 y?. y? p3 z?. x? p5 w?.}

SQ6: SELECT ?x WHERE {x? p5 w?. w? p9 k?. L p11 k?.}

Page 34: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Querying: DecompositionQ5: SELECT ?x WHERE { x? p1 y? . y? p2 z? . z? p3 w? . z? p5 ?k . k? p6 G . k? p7 I . k? p8 H }

P1: {k? p6 G . k? p7 I . k? p8 H }

P2: {x? p1 y? . y? p2 z? . z? p3 w?}

P3: { z? p5 ?k}

Page 35: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Querying: Caching

Page 36: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Querying: Caching (chain queries)

Page 37: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Agenda● Introduction: Motivation, objectives, and contributions● Background

○ RDF○ NoSQL

● State of the Art○ Open Issues

● Rendezvous○ Storing: Fragmentation, Indexing, Partitioning, and Mapping○ Querying: Query decomposition and Caching

● Evaluation● Schedule

Page 38: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Evaluation● LUBM: ontology for the University domain, synthetic RDF data scalable to any

size, and 14 extensional queries representing a variety of properties● Generated dataset with 4000 universities (around 100 GB and contains

around 500 million triples)● 12 queries with joins, all of them have at least one subject-subject join, and

six of them also have at least one subject-object join● Apache Jena version 3.2.0 with Java 1.8, and we use Redis 3.2, MongoDB

3.4.3, and Apache Cassandra 3.10● Amazon m3.xlarge spot with 7.5 GB of memory and 1 x 32 SSD capacity

Page 39: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Evaluation: Rendezvous vs. Rainbow

Page 40: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Evaluation: Rendezvous vs. ScalaRDF

Page 41: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Evaluation: Conclusions

● Fragments are scalable● Bigger boundaries are not necessarily related to bigger

storage size● Graph-aware partitions are better than NoSQL partitions● Near cache is fast but it makes more difficult to keep data

consistency

Page 42: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Evaluation: Future Work

● Compression of triples during the storage● Update and delete operations● Other NoSQL types (e.g., graph)● Better datasets

Page 43: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Agenda● Introduction: Motivation, objectives, and contributions● Background

○ RDF○ NoSQL

● State of the Art○ Open Issues

● Rendezvous○ Storing: Fragmentation, Indexing, Partitioning, and Mapping○ Querying: Query decomposition and Caching

● Evaluation● Schedule

Page 44: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

Schedule

● Middleware development (continuously until 2018)○ Compression○ Graph database○ More complex and abstract workload awareness

● Submission of papers (continuously until 2018)○ Special Interest Group On Management of Data (SIGMOD)○ Very Large Databases (VLDB)○ IEEE Transactions on Knowledge and Data Engineering (TKDE)

● Defense of the PhD thesis (2019)

Page 45: A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

LUBM model