scaling up linked data

67
Scaling up Linked Data Presented by: Marin Dimitrov (Ontotext)

Upload: euclid-project

Post on 11-May-2015

1.928 views

Category:

Technology


3 download

DESCRIPTION

This presentation addresses the main issues of Linked Data and scalability. In particular, it provides gives details on approaches and technologies for clustering, distributing, sharing, and caching data. Furthermore, it addresses the means for publishing data trough could deployment and the relationship between Big Data and Linked Data, exploring how some of the solutions can be transferred in the context of Linked Data.

TRANSCRIPT

Page 1: Scaling up Linked Data

Scaling up Linked Data

Presented by:Marin Dimitrov (Ontotext)

Page 2: Scaling up Linked Data

2

EUCLID Objective

Visualization Module

Metadata

Streaming providers

Physical Wrapper

Downloads

Dat

a ac

quis

ition

R2R Transf.LD Wrapper

Musical Content

Appl

icati

on

Analysis & Mining Module

LD D

atas

etAc

cess

LD Wrapper

RDF/ XML

Integrated Dataset

Interlinking CleansingVocabulary Mapping

SPARQL Endpoint

Publishing

RDFa

Other contentEUCLID – Scaling up Linked Data

Page 3: Scaling up Linked Data

3

• Our aim: build a music-based portal using Linked Data technologies

• So far, we have studied different mechanisms for:• Linked Data management via SPARQL queries • Reasoning over Linked Data• Linked Data access (RDF dumps, endpoints, RDFa)• Linked Data storage in repositories

• In this chapter, we will study current research and technologies to scale up to very large volumes of Linked Data

Motivation: Music!

EUCLID – Scaling up Linked Data

CH 2

CH 3

CH 1

CH 5

Page 4: Scaling up Linked Data

4

Agenda

1. Introduction to Big (Linked) Data

2. NoSQL databases for Linked Data

3. Hadoop for Linked Data

4. Stream processing for Linked Data

5. … and more

EUCLID – Scaling up Linked Data

Page 5: Scaling up Linked Data

5

INTRODUCTION TO BIG (LINKED) DATA

EUCLID – Scaling up Linked Data

Page 6: Scaling up Linked Data

6

Introduction to Big Data

Big Data

Management of data which is “too complex” for being processed with traditional solutions

• Big does not stand primarily for size, but as an analogy for “overwhelming”

• Big can mean “high variety”, “high volume” or “high velocity”

EUCLID – Scaling up Linked Data

Page 7: Scaling up Linked Data

7

The 3 Vs of Big Data

Big Data

Variety

Velocity

Volume

Different forms of data

Petabytes of data

Real-time data streams

Big Data

EUCLID – Scaling up Linked Data

Page 8: Scaling up Linked Data

8

Variety Volume Velocity

Data characteristic

Structured, semi-structured and unstructured

Large volumes of data

Streams, sensors, near real-time data, IoT

Challenge Data integration Reasoning and querying

Reasoning & querying

Solution Semantic technologies are a good fit

Distributed storage & processing, parallel processing

Stream reasoning & querying

The 3 Vs of Big Data

time

EUCLID – Scaling up Linked Data

Page 9: Scaling up Linked Data

9

The Extended Vs of Big Data

• Veracity: Uncertainty of the data

• Variability: Variation in meaning in different contexts

• Value: turning data into information into insight

• Not easy measure

• Depend on context and intended use

• Linked Data & Semantic Technologies can help

Variety VelocityVolume

EUCLID – Scaling up Linked Data

Page 10: Scaling up Linked Data

10

Beyond Big Data

EUCLID – Scaling up Linked Data

Page 11: Scaling up Linked Data

11

Source: Gartner Inc. “Gartner Identifies Top Technology Trends Impacting Information Infrastructure in 2013”

EUCLID – Scaling up Linked Data

Semantic TechnologiesSemantic technologies extract meaning from data, ranging from quantitative data and text, to video, voice and images. Many of these techniques have existed for years and are based on advanced statistics, data mining, machine learning and knowledge management. One reason they are garnering more interest is the renewed business requirement for monetizing information as a strategic asset. Even more pressing is the technical need. Increasing volumes, variety and velocity — big data — in IM and business operations, requires semantic technology that makes sense out of data for humans, or automates decisions

Beyond Big Data (2)

Page 12: Scaling up Linked Data

12

Towards Big Linked Data

• This characteristic is the most inherent to Linked Data

• Agile data model

• Different vocabularies

Variety

Velocity

Volume

2007 2008 2009 2010 2011

• RDF Streams

• Semantic Sensors

EUCLID – Scaling up Linked Data

Page 13: Scaling up Linked Data

13

Towards Big Linked Data (2)

EUCLID – Scaling up Linked Data

Page 14: Scaling up Linked Data

14

Big Linked Data &Linked Big Data

• Exponential growth of Linked Data in the last five years

• Big Data approach adopted by the Linked Data community, especially to handle

Source: M. Dimitrov. “Semantic Technologies for Big Data”

VelocityVolume

Big Linked Data Linked Big Data• Linked Data approach adopted

by the Big Data community

• RDF data model for

• Enrich Big Data with metadata and semantics

• Interlink Big Data sets & reduce duplication

• Simplify data access, discovery & integration

Variety

EUCLID – Scaling up Linked Data

Page 15: Scaling up Linked Data

15

NOSQL DATABASES FORLINKED DATA

EUCLID – Scaling up Linked Data

Page 16: Scaling up Linked Data

16

RDF Databases

• Native or RDBMS based RDF databases

– OWLIM (http://www.ontotext.com/owlim)

– Virtuoso Universal Server (http://virtuoso.openlinksw.com/ )

– Stardog (http://stardog.com)

– AllegroGraph (http://www.franz.com/agraph/allegrograph/ )

– Systap Bigdata (http://www.systap.com/)

– Jena TDB (http://jena.apache.org/documentation/tdb/)

– Oracle, DB2EUCLID – Scaling up Linked Data

Page 17: Scaling up Linked Data

17

RDF Database Advantages

• RDF (graph) based data model

– Global identifies of resources/entities

– Agile schema

• Inference of implicit facts

– Forward, backward, hybrid reasoning strategy

• Expressive query language (SPARQL)

• Compliance to standards

EUCLID – Scaling up Linked Data

Page 18: Scaling up Linked Data

18

NoSQL Databases

• “Not Only SQL”

• a group of databases technologies which don’t follow the relational data model

• Typical requirements– Distributed

– High availability

– Handle big data & query volumes (scalability)

– Hierarchical or graph data structures

– Flexible schema

EUCLID – Scaling up Linked Data

Page 19: Scaling up Linked Data

19

NoSQL Taxonomy

• Key/value stores

– Each key associated with a value (DHT)

• Wide-column stores

– Each key is associated with many attributes, columns are stored together

• Document databases

– Each key associated with a complex data structure

• Graph databases

– Data is represented as nodes and edges

EUCLID – Scaling up Linked Data

ValueKey

Data DataRelationship

Structured-documentKey

Structured-documentKey

Conceptual structures

Artist Album Song

The Beatles

Let it be Get back

Queen Jazz Fun it

Page 20: Scaling up Linked Data

20

Key/Value Stores

• Efficient key/value lookups

• Schema-less

• Simpler read/write operations

– Low latency & high throughput

• Examples– DynamoDB, Azure Table Storage, Riak, Redis, MemcacheDB,

Voldemort

EUCLID – Scaling up Linked Data

ValueKey

Page 21: Scaling up Linked Data

21

Wide-Column Stores

• A key is associated with several attributes• Data in the same column is stored together• Efficient for complex aggregations over data• Schema-less / dynamic schema• Easy to add new columns• Columns can be grouped together (column family)• Examples: – HBase (http://hbase.apache.org)

– Cassandra (http://cassandra.apache.org)

Artist Album Song

The Beatles

Let it be Get back

Queen Jazz Fun it

EUCLID – Scaling up Linked Data

Page 22: Scaling up Linked Data

22

HBase

• Open source column-oriented store• Based on Google’s BigTable• Built on top of HDFS and Hadoop• Horizontally scalable, automatic sharding• high availability / automatic failover • Strongly consistent reads/writes• Java/REST API

EUCLID – Scaling up Linked Data

Page 23: Scaling up Linked Data

23

Document Databases

• Each key associated with a complex data structure (document)

• Documents can contain key/value pairs, key/array pairs, or even nested structures

• Schema-less / dynamic schema– New fields can be easily added to the document structure

• Typical document formats– JSON, XML

• Examples: – Couchbase (http://www.couchbase.com)

– MongoDB (http://www.mongodb.org)

Structured-documentKey

Structured-documentKey

EUCLID – Scaling up Linked Data

Page 24: Scaling up Linked Data

24

Document Databases (2)

Example:{

Homepage: "thebeatles.com", Origin: "Liverpool",

Albums: [ {Title: "Let it be", Year: "1970", Duration:

"35:16"}, {Title: "Help!", Year: "1965"}, {Title: "Revolver", Year: "1966", Duration:

"35:01"} ]}

The Beatles

{FullName: "Elvis Aaron Presley",Homepage: "elvis.com",Origin: "Memphis"Albums: [ {Title: "Blue Hawaii", Year: "1961",

Duration: "32:02"}]

}

Elvis Presley

EUCLID – Scaling up Linked Data

Page 25: Scaling up Linked Data

25

Couchbase

• Document-oriented database– Documents are stored as JSON

• Flexible schema– Document structure easy to change

• Optimised to run in-memory and on several nodes– Ejection and eventual persistence

• Incremental views & indexes• Scalability, rebalancing, replication, failover• RESTful API

EUCLID – Scaling up Linked Data

Page 26: Scaling up Linked Data

26

Network of Friends in a High School

Graph Databases

Motivation

Relationship among artists in Last.fmhttp://sixdegrees.hu/last.fm/

A Fragment of Facebook Relationships between Tweets

Graphs: Representation of highly connected data

EUCLID – Scaling up Linked Data

Page 27: Scaling up Linked Data

27

Graph Databases

• Based on the property graph model• Support for query languages and core graph-based

tasks– reachability, traversal, adjacency and pattern matching

• Examples– Neo4j (http://neo4j.org)

– Dex (http://sparsity-technologies.com/dex.php)

– HyperGraphDB (http://www.hypergraphdb.org)

Data DataRelationship

EUCLID – Scaling up Linked Data

Page 28: Scaling up Linked Data

28

Graph Databases

Example: Property Graph Model

• Nodes and edges may have properties• Properties: Key-value pairs

The Beatles

Let it be

Revolver

Help!

create

d

created

created

Year: 1970Duration: 35:16

Year: 1965

Year: 1966Duration: 35:01

Homepage: thebeatles.comOrigin: Liverpool Elvis Presley Revolver

created

Year: 1961Duration: 32:02

Fullname: Elvis Aaron PresleyHomepage: elvis.comOrigin: Memphis

EUCLID – Scaling up Linked Data

Page 29: Scaling up Linked Data

29

Neo4j

• Graph database– Nodes, Relationships, Properties, Paths– Indexes over properties

• Flexible schema• Cypher graph query language• ACID transactions• High availability, distributed clusters• RESTful and Java APIs

EUCLID – Scaling up Linked Data

Page 30: Scaling up Linked Data

30

Rya

• RDF store based on Accumulo

– Column-store, HDFS

– Sesame query parser, SAIL implementation

• 3 table index

– SPO, POS, OSP

– Sufficient for all triple patterns

– All triple parts (S, P, O) encoded in the RowID

– Clustered indexEUCLID – Scaling up Linked Data

Source: R. Punnoose, A. Crainiceanu, D. Rapp “Rya: A Scalable RDF Triple Store for the Clouds”

Page 31: Scaling up Linked Data

31

Rya (2)

• Query processing

– Sesame (SPARQL) query plan translated to Accumulo range scans & lookups

– Parallel scans for joins (x10-20 speedup)

– Batch scans (Accumulo) to reduce number of range scans

– Statistics for triple patterns selectivity, query re-ordering

• Performance evaluation (LUBM)

– No significant degradation when data grows with 2-3 orders of magnitude

EUCLID – Scaling up Linked Data

Source: R. Punnoose, A. Crainiceanu, D. Rapp “Rya: A Scalable RDF Triple Store for the Clouds”

Page 32: Scaling up Linked Data

32

“NoSQL Databases f0r RDF: An Empirical Evaluation”• Goal

– Store RDF data in HBase, Couchbase, Hive & Cassandra

– Benchmark query performance against a native distributed RDF database (4store)

• HBase prototype

– Jena for SPARQL queries

– 3 index tables (SPO, POS, OSP)

– Row key encodes S+P+O, cells are empty

– Jena query plan translated to HBase filters & lookups

EUCLID – Scaling up Linked Data

Source: Cudre-Mauroux et al. “NoSQL Databases for RDF: An Empirical Evaluation”

Page 33: Scaling up Linked Data

33

“NoSQL Databases f0r RDF: An Empirical Evaluation” (2)• Hive+HBase prototype

– SPARQL to HiveQL translation

– Property table

• Row key is S

• a column for each P

• cell value stores O

• Multi-valued attributes have different timestamps

EUCLID – Scaling up Linked Data

Source: Cudre-Mauroux et al. “NoSQL Databases for RDF: An Empirical Evaluation”

Page 34: Scaling up Linked Data

34

“NoSQL Databases f0r RDF: An Empirical Evaluation” (3)• CumulusRDF prototype

– Sesame for SPARQL queries, Cassandra for data management

– 3 index tables (SPO, POS, OSP)

– Sesame query plan translated to Cassandra index lookups

• Couchbase prototype

– Map RDF into JSON documents

• all triples with the same S stored in the same document (molecule)

• 2 JSON arrays for Ps and Os

– Jena as a SPARQL query engine

– 3 indexes (Couchbase views): SPO, POS, OSP

EUCLID – Scaling up Linked Data

Source: Cudre-Mauroux et al. “NoSQL Databases for RDF: An Empirical Evaluation”

Page 35: Scaling up Linked Data

35

“NoSQL Databases f0r RDF: An Empirical Evaluation” (4)• Benchmarks

– BSBM 10M, 100M and 1B triples

– 1, 2, 4, 8, 16 node cluster

– AWS cost & query execution time

EUCLID – Scaling up Linked Data

Source: Cudre-Mauroux et al. “NoSQL Databases for RDF: An Empirical Evaluation”

Page 36: Scaling up Linked Data

36

“NoSQL Databases f0r RDF: An Empirical Evaluation” (5)• Results

– Simple SPARQL queries can be executed more efficiently on a NoSQL datastore

– Data loading time for some NoSQL datastores comparable or better than the native RDF store

– Complex SPARQL queries perform significantly slower on NoSQL systems

• Query optimisations are required

– MapReduce operations (Hive & Couchbase) introduce high latency for view maintenance / query execution

EUCLID – Scaling up Linked Data

Source: Cudre-Mauroux et al. “NoSQL Databases for RDF: An Empirical Evaluation”

Page 37: Scaling up Linked Data

37

HADOOP FOR LINKED DATA

EUCLID – Scaling up Linked Data

Page 38: Scaling up Linked Data

38

• Apache Hadoop (http://hadoop.apache.org) is an open source implementation of MapReduce

• MapReduce– Distributed batch processing – Map phase partitions the input set (K/V pairs), Reduce phase performs

aggregated processing over the partitions in parallel– Shuffle intermediate results (from Map nodes to Reduce nodes)

• Allows for the processing of distributed large data sets across clusters of computers– On a distributed file system (HDFS)– Scales up to thousands of nodes, each offering local processing power

and storage

Working with Distributed Data

EUCLID – Scaling up Linked Data

Page 39: Scaling up Linked Data

39

“Scalable Distributed Reasoning with MapReduce”• Goal

– Utilise Hadoop for large scale reasoning

• Approach

– Implement each RDFS rule (join) via a Map & Reduce function

– Map outputs original triple as value, and the join term as key

– Reducer receives all needed triples to perform the join

EUCLID – Scaling up Linked Data

Source: Urbani et al. “Scalable Distributed Reasoning with MapReduce”

Page 40: Scaling up Linked Data

40

“Scalable Distributed Reasoning with MapReduce” (2)

EUCLID – Scaling up Linked Data

Source: Urbani et al. “Scalable Distributed Reasoning with MapReduce”

Page 41: Scaling up Linked Data

41

“Scalable Distributed Reasoning with MapReduce” (3)• Challenge

– Too many duplicates (unique to derived triple ratio of 1:50)

• Optimisations

– Replicate schema triples on each mode (in memory)

• Needed for each join; usually a small set

– Rule re-ordering

• Which rule may be triggered by another rule?

• Reduce the number of required iterations

EUCLID – Scaling up Linked Data

Source: Urbani et al. “Scalable Distributed Reasoning with MapReduce”

Page 42: Scaling up Linked Data

42

“Scalable Distributed Reasoning with MapReduce” (4)• Results

– Throughput of 4.5M triples / sec on a 16-node cluster

– 16+ nodes do not improve the performance significantly

EUCLID – Scaling up Linked Data

Source: Urbani et al. “Scalable Distributed Reasoning with MapReduce”

Page 43: Scaling up Linked Data

43

Lessons Learned from Large-scale Reasoning (J. Urbani)• 1st Law: Treat schema triples differently

– Replicate on all nodes to minimise subsequent data transfer

• 2nd Law: Data skew dominates data distribution

– No universal partitioning scheme for input data

– Computation tasks moved to the nodes storing the data (data locality)

• 3rd Law: Certain problems only appear at a very large scale

– Proof-of-concept prototypes are often not representative

EUCLID – Scaling up Linked Data

Source: Jacopo Urbani “Three Laws Learned from Web-scale Reasoning”

Page 44: Scaling up Linked Data

45

STREAM PROCESSING FOR LINKED DATA

EUCLID – Scaling up Linked Data

Page 45: Scaling up Linked Data

46

Streaming Data

• A large amount of new data is constantly being created or data is being updated at a rapid rate– Traffic data, sensor networks, social networks, financial markets

• Many data sources create a constant “stream of information”– Not always practical to store all data and then query it– Continuous queries over transient data

• More recent data is more important– Describes the current state of a dynamic system

time

EUCLID – Scaling up Linked Data

Page 46: Scaling up Linked Data

47

Stream Processing• Streams are observed through windows• Continuous queries can be registered over the stream• Continuous queries are iteratively evaluated over the data in the

current window– Can leverage static background knowledge (e.g., schema information)

• Generates a stream of answersWindow

Stream of answersBackground Knowledge

time

Continuous Query

EUCLID – Scaling up Linked Data

Page 47: Scaling up Linked Data

48

Linked Stream Data

• A representation of sensor/stream data following the Linked Data principles– Sensor data can be enriched with semantics– Facilitates data discovery and integration of heterogeneous data

sources

• Challenges – RDF Triples must be annotated with timestamps– Extensions to the SPARQL language – windows, continuous queries,

streaming operators– Continuous semantics– Scalability (Volume)– High throughput and low latency (Velocity)– Approximate reasoning

EUCLID – Scaling up Linked Data

Page 48: Scaling up Linked Data

49

Querying Streams with SPARQL Extensions• The mechanism to evaluate queries over streaming data is the

specification of continuous queries

• The corresponding results to the continuous query are updated while new data arrives

• Several SPARQL extensions with streaming operators based on CQL (Continuous Query Language)– C-SPARQL – SPARQLStream– EP-SPARQL, CQELS, Instants

EUCLID – Scaling up Linked Data

Page 49: Scaling up Linked Data

50

C-SPARQL (1)

C-SPARQL is an extension of SPARQL 1.1

FromStrClause 'FROM' ['NAMED'] 'STREAM' StreamIRI

' [ RANGE' Window ']'

Window LogicalWindow | PhysicalWindow

LogicalWindow Number TimeUnit WindowOverlap

TimeUnit 'MSEC' | 'SEC' | 'MIN' | 'HOUR' | 'DAY'

WindowOverlap 'STEP' Number TimeUnit | 'TUMBLING'

PhysicalWindow 'TRIPLES' Number

1. RDF Streams: Sequence of RDF triples annotated with timestamps: <(s,p,o), timestamp>

2. FROM STREAM extension for stream sources and windows

EUCLID – Scaling up Linked Data

Page 50: Scaling up Linked Data

51

C-SPARQL (2)

3. Registration• Creates a continuous query over the data source• The query output is variable bindings, RDF graph, or a

new streamRegistration 'REGISTER' ('QUERY'|'STREAM') QName 'AS' Query

EUCLID – Scaling up Linked Data

Page 51: Scaling up Linked Data

52

C-SPARQL (3)

Example

REGISTER QUERY CarsEnteringInDistricts AS

SELECT DISTINCT ?district ?carFROM STREAM <www.uc.eu/tollgates.trdf> [RANGE 40 SEC STEP 10 SEC]WHERE {

?toll t:registers ?car .?toll c:placedIn ?street .?district c:contains ?street . }

Query: Retrieve the cars and districts, where the car was registered in a toll.

Source: Barbieri, Davide Francesco, et al. "Querying rdf streams with c-sparql." ACM SIGMOD Record 39.1 (2010): 20-26.

EUCLID – Scaling up Linked Data

Page 52: Scaling up Linked Data

53

C-SPARQL (4)

EUCLID – Scaling up Linked Data

Source: M. Balduini et al. “Tutorial on Stream Reasoning for Linked Data (ISWC’2013)”

Page 53: Scaling up Linked Data

54

SPARQLStream (1)

• Utilizes the same definition of RDF streams as in C-SPARQL:

• The language is defined as follows:

<(s,p,o), timestamp>

NamedStream 'FROM' ['NAMED'] 'STREAM' StreamIRI ' [' Window ']'

Window 'NOW-' Integer TimeUnit [UpperBound] [Slide]

UpperBound 'TO NOW-' Integer TimeUnit

Slide 'SLIDE' Integer TimeUnit

TimeUnit 'MS' | 'S' | 'MINUTES' | 'HOURS' | 'DAY'

Select 'SELECT' [XStream] [DISTINCT | REDUCED] …

Xstream 'ISTREAM' | 'DSTREAM' | 'RSTREAM'

Source: Jean-Paul Calbimonte and Oscar Corcho. ”SPARQLStream: Ontology-based access to data streams." Tutorial at ISWC 2013

EUCLID – Scaling up Linked Data

Page 54: Scaling up Linked Data

55

SPARQLStream (2)

ExampleQuery: Retrieve a rstream with the observations captured by all sensors in the last

10 minutes.

PREFIX ssn: <http://purl.oclc.org/NET/ssnx/ssn>PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns/#>SELECT RSTREAM ?sensor ?observation FROM STREAM <www.semsorgrid4env.eu/SensorReadings.srdf>

[FROM NOW – 10 MINUTES TO NOW STEP 1 MINUTE]WHERE {

?observation a ssn:Observation; ssn:observedBy ?sensor .}

EUCLID – Scaling up Linked Data

Page 55: Scaling up Linked Data

56

Classification of Existing Systems

EUCLID – Scaling up Linked Data

Source: M. Balduini et al. “Tutorial on Stream Reasoning for Linked Data (ISWC’2013)”

Page 56: Scaling up Linked Data

57

W3C Semantic Sensor Networks• SSN Ontology

– http://www.w3.org/2005/Incubator/ssn/ssnx/ssn – OWL DL ontology– used to semantically describe sensors and sensor networks & data– Recommendations for applying the ontology for Linked Sensor Data

EUCLID – Scaling up Linked Data

Page 57: Scaling up Linked Data

58

W3C Semantic Sensor Networks (2)• Different perspectives

– Sensor, data/observation, system

EUCLID – Scaling up Linked Data

Page 58: Scaling up Linked Data

59

… AND MORE

EUCLID – Scaling up Linked Data

Page 59: Scaling up Linked Data

60

A Trillion RDF Triples

• Use case

– Use RDF and Linked Data for the customer management database of a big telecom

– Franz Inc / AllegroGraph

EUCLID – Scaling up Linked Data

Page 60: Scaling up Linked Data

61

uRiKA Appliance

• YarcData

• Big Data appliance for graph analytics

– 8K processors, 1TB RAM

– In-memory RDF database

– SPARQL 1.1 support

EUCLID – Scaling up Linked Data

Page 61: Scaling up Linked Data

62

RDFS Reasoning on GPUs

• Similar approach to Urbani et al. for large scale reasoning with Hadoop

– Handle rules with 2 antecedents

– Rule reordering

– Dictionary encoding

• Shared-memory architecture

– Efficient GPU algorithm implementation is challenging

EUCLID – Scaling up Linked Data

Source: Norman Heino & Jeff Z. Pan ”RDFS Reasoning on Massively Parallel Hardware" ISWC 2012

Page 62: Scaling up Linked Data

63

RDFS Reasoning on GPUs (2)• Data parallelism

– Apply one rule (thread) on one instance triple, join to a schema triple if possible

– Hundreds / thousands of threads working on parallel

• Challenge

– Duplicate removal

• Benchmark

– x5 speedup of computation

– But… memory transfer overhead is significant

EUCLID – Scaling up Linked Data

Source: Norman Heino & Jeff Z. Pan ”RDFS Reasoning on Massively Parallel Hardware" ISWC 2012

Page 63: Scaling up Linked Data

64

Benchmarks

• BSBM v3.1 (April 2013)– http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbench

mark/results/V7/

– Includes benchmarks with up to 150 billion triples

– x750 scale increase since the last BSBM result (200M triples)

• LDBC

– Industry neutral, non-profit organisation

– Benchmarks for RDF and graph databases, similar to TPC

– Big data volume, complex queries

EUCLID – Scaling up Linked Data

Page 64: Scaling up Linked Data

65

SUMMARY

EUCLID – Scaling up Linked Data

Page 65: Scaling up Linked Data

66

Summary

• Linked Data is a good fit for the Variety challenge of Big Data

• Linked Data can simplify data discovery, data access, data integration challenges for Big Data

• Exponential growth of Linked Data

• Linked Data benchmarks target bigger workloads

EUCLID – Scaling up Linked Data

Page 66: Scaling up Linked Data

67

Summary (2)

• Ongoing R&D towards scaling up Linked Data for high data Volume and Velocity

– NoSQL datastores for RDF data management

– Hadoop for scalable RDF reasoning

– GPUs for scalable RDF reasoning

• Adapting Linked Data & SPARQL for streaming data scenarios

EUCLID – Scaling up Linked Data

Page 67: Scaling up Linked Data

68

For exercises, quiz and further material visit our website:

@euclid_project euclidproject euclidproject

http://www.euclid-project.eu

Other channels:

eBook Course

EUCLID – Scaling up Linked Data