graph databases, triple stores and their uses…
TRANSCRIPT
Graph Databases, Triple Stores and their uses…
San Jose, NoSQL, 2012Jans AasmanCEO Franz Inc
OverviewOverview
• Franz Inc• What is a Graph Database?• An example to start • What is a Triple Store?• Where do people use Graph Databases and Triple Stores?
– Car manufactoring– EPIM: a reporting platform for 31 Oil companiesAmdocs– Amdocs
• Why do people use Graph databases?• How do you get a graph out of your relational databaseHow do you get a graph out of your relational database
Franz Inc – Who We AreFranz Inc Who We Are
• Private, founded 1984 • an AI and
Semantic Technology company• Berkeley/OaklandBerkeley/Oakland
Graph database
Wh i h diff bWhat is the difference between a relational database and a graph
database?
How is it different and why is it fl ibl ?more flexible?
• No Schema. – Say whatever you want to say but
• No Link Tables – because you can do one‐to‐many relationships directly
• No Indexing Choices( )– Can add new data attributes (predicates) on‐the‐fly that
will be real‐time available for querying, because everything is automatically indexed.y g y
• Takes anything you give it: it is trivial to consume– Rows and columns from RDB, XML, RDF(S), OWL, Text and Extracted Entities
A triple store is a special type of h d bgraph database
• Where nodes and links are (mostly) URIs( y)– subject, predicate, object, [graph]– Persistent URIs make is straight forward to link datasets, create web of data (LOD)
• Based on W3C recommendationsRDFS puts an object layer on top of triples– RDFS puts an object layer on top of triples
– OWL adds first order description logic– SPARQL very close to SQL but focusing on graphsSPARQL very close to SQL but focusing on graphs
• Most graph databases live in memory, triple stores can be bigger than memory, rely more on indexing and query optimizers
Demo ‐ LODDemo LOD
Facebook, Bing, Google are all building up big proprietarybuilding up big proprietary
knowledge graphs
The public version:i k dLinked Open Data
Tim Berners‐Lee outlined four principles of linked data:• Use URIs to identify things.• Use HTTP URIs so that these things can be referred to and
looked up ("dereferenced") by people and user agentslooked up ( dereferenced ) by people and user agents.• Provide useful information
about the thing when its URI is dereferenced, using standard formats such as RDF/XML.
• Include links to other, related URIs in the exposed data to improve discovery of p yother related information on the Web.
Oct 2007Oct 2007
LOD cloud – Sept 22 2010LOD cloud Sept 22 2010
latest LOD cloud
Demo politicsDemo politics
Who uses this in the enterprise?Who uses this in the enterprise?
DoD and Intelligence CCustomers
Enterprise ExperienceEnterprise Experience
• Amdocs: a Telco platform that knows (almost) everthing about every customer in real time.– Saves 20 % on the total cost of a Customer Care Operation
• Car manufactor XWarns early for disruptions in the supply chain– Warns early for disruptions in the supply chain
• EPIM: a reporting platform for 31 oil companies.– Create a flexible unified reporting structure over tens ofCreate a flexible unified reporting structure over tens of different proprietary reporting sources.
Risk in Supply Chain Management:determine potential impact of anydetermine potential impact of any disruptions to the supply chain.
Questions that an Early Warning h ldSystem should answer:
– which parts produced by a (sub‐sub‐)vendor will be less p p y ( )available due to a flood in China?
– which of our cars will be affected by political unrest in Thailand?Thailand?
– how can our competitors disrupt our supply chain by buying up all producers of this chip?
– Did one of our sub sub vendors start selling to our competition and what does that mean for us?What happened historically with the price of this sub part– What happened historically with the price of this sub part when the prices for crude oil or any other raw material went up?
– Is one of the (sub sub sub) vendors in our chain in financial distress and how would that affect us.
We need three graphs (or clouds f d ) hof data) come together
• The bills of materials for the cars that we produce, plus the parts tree, plus the names of the first tier vendors that provide parts and if possible our parts inventory and inventory prediction for parts.
• The supply chain for the first tier vendorsThe supply chain for the first tier vendors– Who sells the sub‐parts to our first tier vendors and then go
recursively down this tree– Get the names and geo locations and all other meta data about
this network of vendors and suppliers• Spider the web and business news sources forp
– Every supplier, the countries where they are located, commodities, etc etc..Analyze the text for risks– Analyze the text for risks
The following slide shows the h d bgraph as used by power users
• Company X providing an Exhaust Muffler for a car Yp y p g• That is bought from the first tier vendor USAcme• Who buys it from a vendor in Bangkok (Thai Acme)
• Where we also show a news paper article that has the news about floods in Thailandfloods in Thailand
• And because Thailand has the place Bangkok
• We have a potential risk.
EPIM ReportingHub SolutionEPIM ReportingHub SolutionMay 11th , 2011
Slide 28
y ,
Data Processing Approach
Import Processing Export
XMLSemantic
XMLXML
XML XML Model
Semantic
RDFRepository Excel
MappingRules
OutputTemplates
ExcelSemantic
Tables Model
HTML
&Models
(SPIN)
&Models
(SWP)
RDBSemantic
D2RQ Model
DomainOntologies JSON
(SPIN) (SWP)
Slide 30
Capability Architecture
Reporting Normalization Classification Validation Report DisseminationApplication Ontologies
Reporting Obligations
Normalization Mappings
Classification Models
Validation Rulesets
Report Templates
Dissemination Obligations
Resource
SystemOntologies
Access Control
Transforms Metrics Workflow Logging Notification
Slide 31
DDR ISO 15926 NPD FactPages PCA RDL Operators PartnersResource Ontologies Policies
When graph database or triple ?store?
You have billions of ‘same‐type’ objects and you need to retrieve them extremely fast
You have a fixed size, static data set and you
need fast graph retrieve them extremely fast. Or you need simple analytics.
g pcomputations and pattern
matching.
You need all the features of an enterprise database butYou need to work with
ontology driven knowledge base, rules but also the
flexibility of a graph database
When Graph Database or T i l S ?Triple Store?
When you need ultimate flexibilityWhen you need ultimate flexibility• Modeling knowledge and assets• Hundreds to thousands of classes with different features• Everyday new classes and new features• You work with rules and reasoning
When you need ultimate ‘linkability’When you need ultimate linkability• For (ad hoc) integration of databases
When you need pattern recognition and network analysis• Complex networks of people, companies, products, etc
When you need event processing using geospatial, temporal reasoning and social network analysis combined with flexiblereasoning and social network analysis combined with flexible metadata
Q1: A reasonable hard query for horizontally scaling stores and rdb, a straight forward query
Select ?a ?b ?c ?d ?e
for vertical/parallel store
where {
Franz send-money ?a
?a send money ?b?a send-money ?b
?b send-money ?c
?c send-money Cray
Cray send-money ?d
Not (?d = ?c)
?d send-money ?e?d send money ?e
Not (?e ?b)
?e send-money Franz}
Q1: A very hard query for horizontally scaling stores and rdb a straight forward query for
Find a money trail from Franz to Cray that is more than
stores and rdb, a straight forward query for vertical/parallel store
two steps, find another money trail from Franz Cray that is more than two step where the two trails are completely different
(Select (?path1 ?path2)
(path Franz Cray <send-money> >= 2 ?path1)
(path Cray Franz <send-money> >= 2 ?path2)
(empty (intersection ?path1 ?path2))
Why is this hard in SQLy Q
• Relational databases very good at straight joins but less optimal for self‐joins of unpredictable length
• Try writing this as a sql query☺• Try writing this as a sql query ☺
Why is this hard in distributed key/value storeskey/value stores.
• Databases like Cassandra are extremely good at retrieving nested objects in a see of billion of objects but are less optimal for joins.
• Relatively hard to write these as map reduce expressions Every• Relatively hard to write these as map reduce expressions. Every query has to be expressed as program, ad hoc is therefore discouraged’
A Simple Event OntologyA Simple Event Ontology
• A type• A type– Meetings, communications event, financial transactions, visit, attack/truce, an insurance claim, a purchase order
– RDFS++ reasoningRDFS++ reasoning• A list of actors
– Social Network Analysis• A place• A place
– GeoSpatial Reasoning• A Start‐time and possible an end‐time
Temporal Reasoning– Temporal Reasoning• Anything else that describes the event
– Goods that changed hands
Social Network Analysis yAnswers 4 questions
• How far is P1 from P2 (and how strong is the relation?)relation?)
• To what groups does this person belong (ego groups, cliques?)
• How important is this person in the group?person in the group?
• Does this group have a leader, how cohesive are they?
GeoSpatialGeoSpatial
• Make the following super efficient• Make the following super efficient– Where did something happen?– How far was event1 from event2?– Find all the events that occurred in a bounding box or radius of M miles?
– Do these two shapes overlap?– Find all the objects in theintersection of two shapesintersection of two shapes
• On a very large scale– when things don’t fit in memory– millions of events and polygons
Temporal ReasoningTemporal Reasoning
• Adhere to our convention to encode StartTimes and EndTimes and enjoyEndTimes and enjoy efficient temporal primitives
• Implementation ofAllen’s intervall i i itilogic primitives
And try this on RDB/Cassandra: yActivity Recognition
• Mix SNA with reasoning and temporal/geospatial reasoning.g p /g p g
Find all meetings that happened in November within 5 miles of Berkeley that was attended by the most important person in Jans’ friends and friends of p pfriends.
(select (?x)(ego-group person:jans knows ?group 2) SNA(actor-centrality-members ?group knows ?x ?num) SNA(q ?event fr:actor ?x) DB Lookupq p(qs ?event rdf:type fr:Meeting) RDFS(interval-during ?event “2008-11-01” “2008-11-06”) Temporal(geo-box-around geoname:Berkeley ?event 5 miles) Spatial!)
Thanks..Thanks..