graph databases, triple stores and their uses…

Graph Databases, Triple Stores and their uses…

San Jose, NoSQL, 2012Jans AasmanCEO Franz Inc

OverviewOverview

• Franz Inc• What is a Graph Database?• An example to start • What is a Triple Store?• Where do people use Graph Databases and Triple Stores?

– Car manufactoring– EPIM: a reporting platform for 31 Oil companiesAmdocs– Amdocs

• Why do people use Graph databases?• How do you get a graph out of your relational databaseHow do you get a graph out of your relational database

Franz Inc – Who We AreFranz Inc Who We Are

• Private, founded 1984 • an AI and

Semantic Technology company• Berkeley/OaklandBerkeley/Oakland

Graph database

Wh i h diff bWhat is the difference between a relational database and a graph

database?

How is it different and why is it fl ibl ?more flexible?

• No Schema. – Say whatever you want to say but

• No Link Tables – because you can do one‐to‐many relationships directly

• No Indexing Choices( )– Can add new data attributes (predicates) on‐the‐fly that

will be real‐time available for querying, because everything is automatically indexed.y g y

• Takes anything you give it: it is trivial to consume– Rows and columns from RDB, XML, RDF(S), OWL, Text and Extracted Entities

A triple store is a special type of h d bgraph database

• Where nodes and links are (mostly) URIs( y)– subject, predicate, object, [graph]– Persistent URIs make is straight forward to link datasets, create web of data (LOD)

• Based on W3C recommendationsRDFS puts an object layer on top of triples– RDFS puts an object layer on top of triples

– OWL adds first order description logic– SPARQL very close to SQL but focusing on graphsSPARQL very close to SQL but focusing on graphs

• Most graph databases live in memory, triple stores can be bigger than memory, rely more on indexing and query optimizers

Demo ‐ LODDemo LOD

Facebook, Bing, Google are all building up big proprietarybuilding up big proprietary

knowledge graphs

The public version:i k dLinked Open Data

Tim Berners‐Lee outlined four principles of linked data:• Use URIs to identify things.• Use HTTP URIs so that these things can be referred to and

looked up ("dereferenced") by people and user agentslooked up ( dereferenced ) by people and user agents.• Provide useful information

about the thing when its URI is dereferenced, using standard formats such as RDF/XML.

• Include links to other, related URIs in the exposed data to improve discovery of p yother related information on the Web.

Oct 2007Oct 2007

LOD cloud – Sept 22 2010LOD cloud Sept 22 2010

latest LOD cloud

Demo politicsDemo politics

Who uses this in the enterprise?Who uses this in the enterprise?

DoD and Intelligence CCustomers

Enterprise ExperienceEnterprise Experience

• Amdocs: a Telco platform that knows (almost) everthing about every customer in real time.– Saves 20 % on the total cost of a Customer Care Operation

• Car manufactor XWarns early for disruptions in the supply chain– Warns early for disruptions in the supply chain

• EPIM: a reporting platform for 31 oil companies.– Create a flexible unified reporting structure over tens ofCreate a flexible unified reporting structure over tens of different proprietary reporting sources.

Risk in Supply Chain Management:determine potential impact of anydetermine potential impact of any disruptions to the supply chain.

Questions that an Early Warning h ldSystem should answer:

– which parts produced by a (sub‐sub‐)vendor will be less p p y ( )available due to a flood in China?

– which of our cars will be affected by political unrest in Thailand?Thailand?

– how can our competitors disrupt our supply chain by buying up all producers of this chip?

– Did one of our sub sub vendors start selling to our competition and what does that mean for us?What happened historically with the price of this sub part– What happened historically with the price of this sub part when the prices for crude oil or any other raw material went up?

– Is one of the (sub sub sub) vendors in our chain in financial distress and how would that affect us.

We need three graphs (or clouds f d ) hof data) come together

• The bills of materials for the cars that we produce, plus the parts tree, plus the names of the first tier vendors that provide parts and if possible our parts inventory and inventory prediction for parts.

• The supply chain for the first tier vendorsThe supply chain for the first tier vendors– Who sells the sub‐parts to our first tier vendors and then go

recursively down this tree– Get the names and geo locations and all other meta data about

this network of vendors and suppliers• Spider the web and business news sources forp

– Every supplier, the countries where they are located, commodities, etc etc..Analyze the text for risks– Analyze the text for risks

The following slide shows the h d bgraph as used by power users

• Company X providing an Exhaust Muffler for a car Yp y p g• That is bought from the first tier vendor USAcme• Who buys it from a vendor in Bangkok (Thai Acme)

• Where we also show a news paper article that has the news about floods in Thailandfloods in Thailand

• And because Thailand has the place Bangkok

• We have a potential risk.

EPIM ReportingHub SolutionEPIM ReportingHub SolutionMay 11th , 2011

Slide 28

y ,

Data Processing Approach

Import Processing Export

XMLSemantic

XMLXML

XML XML Model

Semantic

RDFRepository Excel

MappingRules

OutputTemplates

ExcelSemantic

Tables Model

HTML

&Models

(SPIN)

&Models

(SWP)

RDBSemantic

D2RQ Model

DomainOntologies JSON

(SPIN) (SWP)

Slide 30

Capability Architecture

Reporting Normalization Classification Validation Report DisseminationApplication Ontologies

Reporting Obligations

Normalization Mappings

Classification Models

Validation Rulesets

Report Templates

Dissemination Obligations

Resource

SystemOntologies

Access Control

Transforms Metrics Workflow Logging Notification

Slide 31

DDR ISO 15926 NPD FactPages PCA RDL Operators PartnersResource Ontologies Policies

When graph database or triple ?store?

You have billions of ‘same‐type’ objects and you need to retrieve them extremely fast

You have a fixed size, static data set and you

need fast graph retrieve them extremely fast. Or you need simple analytics.

g pcomputations and pattern

matching.

You need all the features of an enterprise database butYou need to work with

ontology driven knowledge base, rules but also the

flexibility of a graph database

When Graph Database or T i l S ?Triple Store?

When you need ultimate flexibilityWhen you need ultimate flexibility• Modeling knowledge and assets• Hundreds to thousands of classes with different features• Everyday new classes and new features• You work with rules and reasoning

When you need ultimate ‘linkability’When you need ultimate linkability• For (ad hoc) integration of databases

When you need pattern recognition and network analysis• Complex networks of people, companies, products, etc

When you need event processing using geospatial, temporal reasoning and social network analysis combined with flexiblereasoning and social network analysis combined with flexible metadata

Q1: A reasonable hard query for horizontally scaling stores and rdb, a straight forward query

Select ?a ?b ?c ?d ?e

for vertical/parallel store

where {

Franz send-money ?a

?a send money ?b?a send-money ?b

?b send-money ?c

?c send-money Cray

Cray send-money ?d

Not (?d = ?c)

?d send-money ?e?d send money ?e

Not (?e ?b)

?e send-money Franz}

Q1: A very hard query for horizontally scaling stores and rdb a straight forward query for

Find a money trail from Franz to Cray that is more than

stores and rdb, a straight forward query for vertical/parallel store

two steps, find another money trail from Franz Cray that is more than two step where the two trails are completely different

(Select (?path1 ?path2)

(path Franz Cray <send-money> >= 2 ?path1)

(path Cray Franz <send-money> >= 2 ?path2)

(empty (intersection ?path1 ?path2))

Why is this hard in SQLy Q

• Relational databases very good at straight joins but less optimal for self‐joins of unpredictable length

• Try writing this as a sql query☺• Try writing this as a sql query ☺

Why is this hard in distributed key/value storeskey/value stores.

• Databases like Cassandra are extremely good at retrieving nested objects in a see of billion of objects but are less optimal for joins.

• Relatively hard to write these as map reduce expressions Every• Relatively hard to write these as map reduce expressions. Every query has to be expressed as program, ad hoc is therefore discouraged’

A Simple Event OntologyA Simple Event Ontology

• A type• A type– Meetings, communications event, financial transactions, visit, attack/truce, an insurance claim, a purchase order

– RDFS++ reasoningRDFS++ reasoning• A list of actors

– Social Network Analysis• A place• A place

– GeoSpatial Reasoning• A Start‐time and possible an end‐time

Temporal Reasoning– Temporal Reasoning• Anything else that describes the event

– Goods that changed hands

Social Network Analysis yAnswers 4 questions

• How far is P1 from P2 (and how strong is the relation?)relation?)

• To what groups does this person belong (ego groups, cliques?)

• How important is this person in the group?person in the group?

• Does this group have a leader, how cohesive are they?

GeoSpatialGeoSpatial

• Make the following super efficient• Make the following super efficient– Where did something happen?– How far was event1 from event2?– Find all the events that occurred in a bounding box or radius of M miles?

– Do these two shapes overlap?– Find all the objects in theintersection of two shapesintersection of two shapes

• On a very large scale– when things don’t fit in memory– millions of events and polygons

Temporal ReasoningTemporal Reasoning

• Adhere to our convention to encode StartTimes and EndTimes and enjoyEndTimes and enjoy efficient temporal primitives

• Implementation ofAllen’s intervall i i itilogic primitives

And try this on RDB/Cassandra: yActivity Recognition

• Mix SNA with reasoning and temporal/geospatial reasoning.g p /g p g

Find all meetings that happened in November within 5 miles of Berkeley that was attended by the most important person in Jans’ friends and friends of p pfriends.

(select (?x)(ego-group person:jans knows ?group 2) SNA(actor-centrality-members ?group knows ?x ?num) SNA(q ?event fr:actor ?x) DB Lookupq p(qs ?event rdf:type fr:Meeting) RDFS(interval-during ?event “2008-11-01” “2008-11-06”) Temporal(geo-box-around geoname:Berkeley ?event 5 miles) Spatial!)

Thanks..Thanks..

graph databases, triple stores and their uses…

Documents