find your way in graph labyrinths

Find your way in Graph labyrinths

with SQL, SPARQL, and Gremlin

who we are?Daniel Camarda

[email protected]://github.com/mdread

Alfredo [email protected]

https://github.com/seralf

It’s all about relationsfor example: northwind DB ...on graph

SEE: http://sql2gremlin.com/

schema?properties or relations?joins or edges?

http://sql2gremlin.com/

SQL 1. - ER: tables for Entity and Relations

A table is really similar in practice to a flat CSV. But:● It introduces types.● Can be used to materialize important relations, not only entities, normalizing data (=avoiding

duplications)● Can be fast to access using Indexes● Logical Entity can be physically splitted into many different Tables, after normalization.● Relations are not explicit they are:

○ materialized as properties/tables○ expressed by constraints○ retrieved by joins

ROW -> TUPLE!

SEE: Northwind schema

RDF 1. - modeling

But tuples can be more “atomic”, if we think differently.

RDF (Resource Description Framework): introduces a conceptual data modeling approach inspired by several best practices, including the well-known dublin-core.Similar role to ER schemas (mostly used on relational DB), or class diagram (mostly used in software design).RDF is based upon describing resources, by making statements about them: both data and metadata can be described this way (self-described).

Then we have TUPLEs -> TRIPLEs! (actually QUADs, at least)subject -> predicate -> object (+ context!)

Thus it is a multigraph labeled and directed: it's the best architecture for managing ontologies, and it can be also managed more or less as a property graph.

RDF 2. - schema

Have you said schema? What is a Schema?

● A schema describes your model● A schema can defines constraints and data types on your model● A schema provides a good abstraction on the raw data (to be handled manually)

What is the best language to describe schemas?● XML: DTD is not XML, XSD is XML● DDL is SQL, but dialect, dictionary and schema changes● RDF can describe both data and metadata (schema)

○ Are we afraid of standards? Why? Are they too much complex?○ Schema must be mantained!

RDF 3. - a shared language for schemas

A standardized framework for the description of models it's only a shared language!1) No one is forced to adopt a specific vocabulary: only a basic syntax is shared among different domains.2) However different domains can be modeled sharing both schema and data linking, creating a wider knowledge graph.

examples: all kind of linked data, vocabularies such as good relations, schema.org and so on

http://www.google.com/insidesearch/features/search/knowledge.htmlhttps://www.freebase.com/http://dbpedia.org/

http://www.google.com/insidesearch/features/search/knowledge.html


https://www.freebase.com/


http://dbpedia.org/

http://dbpedia.org/

RDF 4. - looking at an RDF vocabulary (schema)

How does one of those RDF vocabulary can look like?

For example FOAF (Friend Of A Friend)vocabulary,using the VOWL toolkit http://vowl.visualdataweb.org/

SQL & gremlin - 1

SQLSELECT CategoryNameFROM Categories

Gremling.V('type','category').categoryName

SPARQLSELECT ?categoryWHERE {

?uri a ?category .}

SQL & gremlin - 2

SQL

SELECT * FROM Products AS PINNER JOIN Categories AS CON (C.CategoryID = P.CategoryID) WHERE (C.CategoryName = 'Beverages')

SPARQL

SELECT *FROM <http://northwind/graph>WHERE {

?uri a nw:Product .?uri nw:has_category ?category .?category a nw:Category .?category nw:categoryName 'Beverages' .

}

SELECT *FROM <http://northwind/graph>WHERE {

?uri a nw:Product .?uri nw:has_category / nw:categoryName

'Beverages' .}

Gremlin

g.V('categoryName','Beverages').in('inCategory').map()

From table to graph: two strategies1. RDF mapping, with tools R2RML (Relational to RDF Mapping Language) and DM (Direct

Mapping)a. builds an RDF graph, and the mapping itself is also RDF (turtle)b. triples can be mapped live from the relational engine, or materialized into a triplestore

2. Build your own graph model.a. no need for learn a new languageb. no need for introduce external tools as dependencies

In both cases, a projection of the graph can be used to produce either different graph or tables schema

Example: Github graphThe idea search for repositories on github, get information about those repos along with collaborators and library dependencies

Why? Github has lots of interesting data, analyzing it can give us insights on how the opensource community is evolving. A graph is the best way to represent this kind of deeply interconnected community

How it works? Tinkerpop is used on top of OrientDB which is the backend graph engine. The data is retrieved by a small Scala application

github schema

Graph visualized

generated with gephi https://gephi.org/

● an interactive tool for exploration and analysis of graphs

● connect with external data sources with the Stream plugin

● useful when thinking about your queries

repository

dependencyuser

Github data collected on Orient Graph:https://github.com/randomknot/graph-labyrinth-demo

https://gephi.org/

https://github.com/randomknot/graph-labyrinth-demo

https://github.com/randomknot/graph-labyrinth-demo

Is a query language, specifically built for graph traversal● easy to navigate relationships (edges)● easy to filter ● start thinking about Paths, not Records● turing complete language● default implementation as a Groovy DSL

examples 1All contributors of a repository

g.v("#11:192").in("contributes").login

projects on which users of this project contribute to

g.v("#11:192").in("contributes").out("contributes").dedup.name

Repositories with more than ten contributors

g.V("node_type", "Repository").filter{it.inE("contributes").count() > 10}.name

examples 2common contributors of two projects

g.v('#11:47').in("contributes").as("x").out.retain([g.v('#11:57')]).back("x").login

users who work on projects, using a specific library

g.V("node_type", "Contributor").as("usr").out("contributes").out("depends").filter{it.artifact_id == "spring-social-web"}.back("usr").login

how gremlin select nodes?

examples 3

five most used libraries

g.V("node_type", "Dependency").inE("depends").inV.groupCount{it.artifact_id}.cap.orderMap(T.decr)[0..4]

contributors of projects with more than ten contributors

g.V("node_type", "Repository").filter{it.inE("contributes").count() > 10}.in("contributes").login

The end

references

● Freebase knowledge basehttps://www.freebase.com/

● Google Knowledge Graphhttp://www.google.com/insidesearch/features/search/knowledge.html

● RDF○ RDF primer

http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/○ VOWL

http://vowl.visualdataweb.org/○ FOAF - Friend Of A Friend

http://www.foaf-project.org/● dbeaver

http://dbeaver.jkiss.org/





http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/

http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/

http://vowl.visualdataweb.org/

http://vowl.visualdataweb.org/

http://www.foaf-project.org/

http://www.foaf-project.org/



references

● gremlin documentationhttps://github.com/tinkerpop/gremlin/wikihttp://gremlindocs.com/

● sql2gremlinhttp://sql2gremlin.com/

○ visualization: http://sql2gremlin.com/graph/○ joins: http://sql2gremlin.com/#joining/inner-join

● gremlin exampleshttp://www.fromdev.com/2013/09/Gremlin-Example-Query-Snippets-Graph-DB.html

● SPARQL + gremlinhttps://github.com/tinkerpop/gremlin/wiki/SPARQL-vs.-Gremlin

● using SPARQL qith gephi to visualize co-authorshiphttp://data.linkededucation.org/linkedup/devtalk/?p=31

● mining github followers in tinkerpop (with R, github, neo4j)http://patrick.wagstrom.net/weblog/2012/05/13/mining-github-followers-in-tinkerpop/

https://github.com/tinkerpop/gremlin/wiki

https://github.com/tinkerpop/gremlin/wiki

http://gremlindocs.com/

http://gremlindocs.com/



http://sql2gremlin.com/graph/

http://sql2gremlin.com/#joining/inner-join

http://www.fromdev.com/2013/09/Gremlin-Example-Query-Snippets-Graph-DB.html

http://www.fromdev.com/2013/09/Gremlin-Example-Query-Snippets-Graph-DB.html

https://github.com/tinkerpop/gremlin/wiki/SPARQL-vs.-Gremlin

https://github.com/tinkerpop/gremlin/wiki/SPARQL-vs.-Gremlin

http://data.linkededucation.org/linkedup/devtalk/?p=31

http://data.linkededucation.org/linkedup/devtalk/?p=31

http://patrick.wagstrom.net/weblog/2012/05/13/mining-github-followers-in-tinkerpop/

http://patrick.wagstrom.net/weblog/2012/05/13/mining-github-followers-in-tinkerpop/

find your way in graph labyrinths

Software

uri nw

shared language

ten contributors

rdf vocabulary

http

language

graph

data