shape expression schemas for...

101
Shape Expression Schemas for RDF Semantics, Complexity, and Inference Slawek Staworko University of Lille & INRIA LINKS GT-ALGA 2019 Paris, France 10 October 2019 Slawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 1 / 57

Upload: others

Post on 21-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Shape Expression Schemas for RDFSemantics, Complexity, and Inference

S lawek StaworkoUniversity of Lille & INRIA LINKS

GT-ALGA 2019

Paris, France

10 October 2019

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 1 / 57

Page 2: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

What is RDF and does it need schemas?

bug1 bug2

bug3 bug4

user1

user2

emp1

“Boom!” “Kaboom!”

“Kabang!” “Bang!”

“John”“Mary” “[email protected]

“Steve”“[email protected]

name

name

email

name

email

relatedrelated

reportedBy

reproducedBy

descr

related

reportedBy

descr

reportedBy

descr

reportedBy

descr

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 2 / 57

Page 3: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

What is RDF and does it need schemas?

bug1 bug2

bug3 bug4

user1

user2

emp1

“Boom!” “Kaboom!”

“Kabang!” “Bang!”

“John”“Mary” “[email protected]

“Steve”“[email protected]

name

name

email

name

email

relatedrelated

reportedBy

reproducedBy

descr

related

reportedBy

descr

reportedBy

descr

reportedBy

descr

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 2 / 57

Page 4: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

RDF at its conception and RDF now

What is RDF?

I set of triples 〈subject, predicate, object〉

Originally, free-range RDF

I The driving technology of Web 3.0 [Tim Berners-Lee]

I “Just publish your data so others can access it!”

I Intentionally schema-free and ontology oriented

Nowadays, industrial-strength RDF

I Produced and consumed by applications (data exchange format)

I Often obtained from exporting data from relational databases (e.g., R2RML)

I Follows a strict structure

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 3 / 57

Page 5: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Outline

Uses for schemas

I Provide a semantic insight into data

I Capture the structure of the graph (summary)

I Enable validation i.e., checking data conformance

If RDF that does not come with schema?

I Infer the schema

If the RDF does not satisfy its schema?

I Repair the RDF

Initial research [ICDT’15]

I Define the semantics of ShEx

I Complexity of validation

I Expressive power

Joint work with joint work with I. Boneva, J. Labra Gayo,S. Hym, E. G. Prud’hommeaux, and H. Solbrig

Current research

I Containment problem [PODS’19]

I Grammatical inference (ongoing)

Joint work with P. Wieczorek, A. Lemay, and B. Groz

Future research

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 4 / 57

Page 6: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Previously existing schema formalisms for RDF

RDF Schema (RDFS) [W3C]:

I lightweight ontology language (types and type inclusion relations)

I range and domain constraints for properties (predicate types)

I virtually no power to constrain the structure of the graph

bug1

bug2

user1

user2

related

reportedBy

reportedBy

rdfs:Classrdf:Property

BugReport User

reportedByrelated

rdf:typerdf:type rdf:type

rdf:type

rdfs:

subCl

rdfs:subCl

rdfs:subPrrdfs:su

bPr

rdfs:domain rdfs:range

rdfs:domain

rdfs:range

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 5 / 57

Page 7: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Previously existing schema formalisms for RDF (cont’d.)

OWL + CWA + UNA [Sirin, RR’10]

I Potentially confusing nonstandard semantics

I Potentially high complexity of validation

SPARQL (SPIN) [Bolleman et al., SWAT4LS’12]

I Very powerful and expressive

I High complexity

Resource Shapes [IBM, Ryman et al., LDOW’13]

I Extends RDFS with simple cardinality constraints on the outbound neighborhood of a node

What does exactly RDF validation mean

Verification the typing is given (with rdf:type) and its correctness is to be verified(this meaning is employed by the above schema proposals)

Model checking no typing is given and the goal is to construct a valid typing(this is a more general problem that we are interested in)

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 6 / 57

Page 8: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Shape Expressions Schemas

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 7 / 57

Page 9: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Shape Expression Schema (ShEx)

Syntax

ShEx is a set of rules of the form Type → RegExp(Predicate × Type)

bug1 bug2

bug3 bug4

user1

user2

emp1

“Boom!” “Kaboom!”

“Kabang!” “Bang!”

“John”“Mary” “[email protected]

“Steve”“[email protected]

name

name

email

name

email

relatedrelated

reportedBy

reproducedBy

descr

related r

eportedBy

descr

reportedBy

descr

reportedBy

descr

Bug → descr :: str,

reportedBy :: User,

reproducedBy :: Employee?,related :: Bug*

User → name :: str,

email :: str?

Employee → name :: str,

email :: str

SemanticsGraph satisfies a schema if every node has at least one type

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 8 / 57

Page 10: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Shape Expression Schema (ShEx)

Syntax

ShEx is a set of rules of the form Type → RegExp(Predicate × Type)

bug1 bug2

bug3 bug4

user1

user2user2user2

emp1emp1emp1

“Boom!” “Kaboom!”

“Kabang!” “Bang!”

“John”“Mary” “[email protected]

“Steve”“[email protected]

name

name

email

name

email

relatedrelated

reportedBy

reproducedBy

descr

related r

eportedBy

descr

reportedBy

descr

reportedBy

descr

Bug → descr :: str,

reportedBy :: User,

reproducedBy :: Employee?,related :: Bug*

User → name :: str,

email :: str?

Employee → name :: str,

email :: str

SemanticsGraph satisfies a schema if every node has at least one type

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 8 / 57

Page 11: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Background information

Shape Expressions Schemas (ShEx)

I Inspired by XML Schema and reminiscent of (tree) automata

I Based on regular expressions under commutative closuremembership NP-c [Kopczynski&To’10]; containment coNEXP-c [Haase&Hofman’16]

I Envisioned as a potential XSLT-like transformation engine for RDF

ShEx vs SHACL

I ShEx is a schema language with a growing base of users and a host of applications (e.g., Wikidata)

I SHACL is Shape Constraint Language (e.g., path constraints)

I significant overlap (upcoming paper) but also differences (recursion, negation etc.)

I comparable validation complexity (NP-complete)

I both have been developed under the tutelage of W3C

I SHACL ended up a W3C Recommendation (yay!), ShEx a W3C Community Group Project

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 9 / 57

Page 12: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Flavorful ShEx

<Emp> {

name xsd:string | (first-name xsd:string, last-name xsd:string),

email xsd:string,

department @<Dept>

}

<Dept> {

name xsd:string,

reportedBy @<User>,

reportedOn xsd:dateTime,

(manager @<Emp>, appointedOn xsd:dateTime)?,

employees @<Emp>*

}

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 10 / 57

Page 13: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Bags of symbols (unordered words)

Bag (multiset) is a function mapping a symbol to the number of its occurrences.

w0 = {|a, a, a, c, c|} represents the function w0(a) = 3, w0(b) = 0, and w0(c) = 2.

The collection of outgoing labels is a bag:

out-labG (n) = {|a | (n, a,m) ∈ EG |}

G0:

n0

n1

n2

n3

n4

a

b

b

a

b

c

a

out-labG0 (n0) = {|a, a, b|}

Bag union: {|a, c, c|} ] {|a, b|} = {|a, a, b, c, c|} (concatenation of unordered words).

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 11 / 57

Page 14: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Regular Bag Expressions (RBEs)

Language of regular expressions for defining bags (unordered concatenation , )

E ::= ε | a | E∗ | (E“|”E) | (E“, ”E)

with natural macros: E? := (ε | E) and E+ := (E , E∗)

Examples

I (a∗, b+, c, c) – arbitrary number of a’s, positive number of b’s, and two c’s

I (a, b)∗ – the same number of a’s and b’s

I (a, b, c)∗ – the same number of a’s, b’s, and c’s.

RBEs are equivalent to

1. Presburger arythmetic (PA),

2. Parikh images of context-free languages,

3. semilinear sets.

Computational properties

I Membership w ∈ E is NP-complete,

I Emptiness E1 ∩ E2 = ∅ is coNP-complete.

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 12 / 57

Page 15: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

RBE0 and ShEx0 = ShEx(RBE0)

RBE0

I RBEs using only symbols with multiplicities {0, 1, ∗,+, ?} and , operator only

I can be canonized a, a? ≡ a[1,2], b+, b+ ≡ b[2,∞], etc.

I the canonical form is a[n,n′], b[m,m′], . . .

I Presburger formulas: conjunctions of atoms #a < n and #a > n

I captures IBM’s Resource Shapes

Computational properties: simple arithmetic

A lightweight class enjoying tractability of a number of problems:

I membership

I containment

I intersection (also with RBE1)

Also learnable from positive examples [DBPL’13]

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 13 / 57

Page 16: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Getting the semantics of ShEx right

G0:

n0

n1

n2

n3

n4

t0

t1

t2

t1

t2

a

b

b

a

b

c

aS0: t0 → (a :: t1)+, b :: t2

t1 → (a :: t1 | b :: t2)∗

t2 → b :: t2 | c :: t1

out-lab-typeλ0G0

(n0) = {|a :: t1, a :: t1, b :: t2|}

A single-type typing is a function λ : V → Γ.

λ is valid if every node n satisfies its type definition i.e.,

out-lab-typeλG (n) = {|a :: λ(m) | (n, a,m) ∈ E |} ∈ δ(λ(n)).

A valid single-type typing of G0 w.r.t. S0

λ0(n0) = t0, λ0(n1) = t1, λ0(n2) = t2, λ0(n3) = t1, λ0(n4) = t2.

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 14 / 57

Page 17: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Getting the semantics of ShEx right

G0:

n0

n1

n2

n3

n4

t0

t1

t2

t1

t2

a

b

b

a

b

c

aS0: t0 → (a :: t1)+, b :: t2

t1 → (a :: t1 | b :: t2)∗

t2 → b :: t2 | c :: t1

out-lab-typeλ0G0

(n0) = {|a :: t1, a :: t1, b :: t2|}

A single-type typing is a function λ : V → Γ.

λ is valid if every node n satisfies its type definition i.e.,

out-lab-typeλG (n) = {|a :: λ(m) | (n, a,m) ∈ E |} ∈ δ(λ(n)).

A valid single-type typing of G0 w.r.t. S0

λ0(n0) = t0, λ0(n1) = t1, λ0(n2) = t2, λ0(n3) = t1, λ0(n4) = t2.

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 14 / 57

Page 18: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Getting the semantics of ShEx right

G0:

n0

n1

n2

n3

n4

t0

t1

t2

t1

t2

a

b

b

a

b

c

aS0: t0 → (a :: t1)+, b :: t2

t1 → (a :: t1 | b :: t2)∗

t2 → b :: t2 | c :: t1

out-lab-typeλ0G0

(n0) = {|a :: t1, a :: t1, b :: t2|}

A single-type typing is a function λ : V → Γ.

λ is valid if every node n satisfies its type definition i.e.,

out-lab-typeλG (n) = {|a :: λ(m) | (n, a,m) ∈ E |} ∈ δ(λ(n)).

A valid single-type typing of G0 w.r.t. S0

λ0(n0) = t0, λ0(n1) = t1, λ0(n2) = t2, λ0(n3) = t1, λ0(n4) = t2.

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 14 / 57

Page 19: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Getting the semantics of ShEx right

G0:

n0

n1

n2

n3

n4

t0

t1

t2

t1

t2

a

b

b

a

b

c

aS0: t0 → (a :: t1)+, b :: t2

t1 → (a :: t1 | b :: t2)∗

t2 → b :: t2 | c :: t1

out-lab-typeλ0G0

(n0) = {|a :: t1, a :: t1, b :: t2|}

A single-type typing is a function λ : V → Γ.

λ is valid if every node n satisfies its type definition i.e.,

out-lab-typeλG (n) = {|a :: λ(m) | (n, a,m) ∈ E |} ∈ δ(λ(n)).

A valid single-type typing of G0 w.r.t. S0

λ0(n0) = t0, λ0(n1) = t1, λ0(n2) = t2, λ0(n3) = t1, λ0(n4) = t2.

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 14 / 57

Page 20: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Intractability of single-type validation

Validation problem

Checking if there exists a valid typing of given input graph w.r.t. a given input schema.

Sources of complexity

1. guessing a typing

2. checking that it is valid (RBE membership is already NP-complete)

TheoremSingle-type validation is NP-complete (even if only RBE0 are used).

Reduction from graph 3-colorability:

tr → :: t∗b , :: t∗g tg → :: t∗r , :: t∗b tb → :: t∗g , :: t∗r

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 15 / 57

Page 21: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Intractability of single-type validation

Validation problem

Checking if there exists a valid typing of given input graph w.r.t. a given input schema.

Sources of complexity

1. guessing a typing

2. checking that it is valid (RBE membership is already NP-complete)

TheoremSingle-type validation is NP-complete (even if only RBE0 are used).

Reduction from graph 3-colorability:

tr → :: t∗b , :: t∗g tg → :: t∗r , :: t∗b tb → :: t∗g , :: t∗r

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 15 / 57

Page 22: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Multi-type semantics of shape expressions

A multi-type typing is a function λ : V → 2Γ that assign to every node a set of types.

G1: n0 n1 n2a c

b

S1: t0 → a :: t1

t1 → b :: t2, c :: t3

t2 → (b :: t2)?, c :: t3

t3 → ε

G2: m0 m1

a

bt0 t1 t2

S2: t0 → a :: t1, b :: t2

t1 → (c :: t1)∗

t2 → (d :: t2)?

t0 t1

t2

t3

λ1(n0) = {t0}, λ1(n1) = {t1, t2}, λ1(n2) = {t3}.

OutType(n1, λ1) = (b :: t1 | b :: t2), c :: t3

λ is valid if every node satisfies every of its associated types.

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 16 / 57

Page 23: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Multi-type semantics of shape expressions

A multi-type typing is a function λ : V → 2Γ that assign to every node a set of types.

G1: n0 n1 n2a c

b

S1: t0 → a :: t1

t1 → b :: t2, c :: t3

t2 → (b :: t2)?, c :: t3

t3 → ε

G2: m0 m1

a

bt0 t1 t2

S2: t0 → a :: t1, b :: t2

t1 → (c :: t1)∗

t2 → (d :: t2)?

t0

t1

t2

t3

λ1(n0) = {t0}, λ1(n1) = {t1, t2}, λ1(n2) = {t3}.

OutType(n1, λ1) = (b :: t1 | b :: t2), c :: t3

λ is valid if every node satisfies every of its associated types.

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 16 / 57

Page 24: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Multi-type semantics of shape expressions

A multi-type typing is a function λ : V → 2Γ that assign to every node a set of types.

G1: n0 n1 n2a c

b

S1: t0 → a :: t1

t1 → b :: t2, c :: t3

t2 → (b :: t2)?, c :: t3

t3 → ε

G2: m0 m1

a

bt0 t1 t2

S2: t0 → a :: t1, b :: t2

t1 → (c :: t1)∗

t2 → (d :: t2)?

t0 t1

t2 t3

λ1(n0) = {t0}, λ1(n1) = {t1, t2}, λ1(n2) = {t3}.

OutType(n1, λ1) = (b :: t1 | b :: t2), c :: t3

λ is valid if every node satisfies every of its associated types.

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 16 / 57

Page 25: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Multi-type semantics of shape expressions

A multi-type typing is a function λ : V → 2Γ that assign to every node a set of types.

G1: n0 n1 n2a c

b

S1: t0 → a :: t1

t1 → b :: t2, c :: t3

t2 → (b :: t2)?, c :: t3

t3 → ε

G2: m0 m1

a

bt0 t1 t2

S2: t0 → a :: t1, b :: t2

t1 → (c :: t1)∗

t2 → (d :: t2)?

t0 t1 t2 t3

λ1(n0) = {t0}, λ1(n1) = {t1, t2}, λ1(n2) = {t3}.

OutType(n1, λ1) = (b :: t1 | b :: t2), c :: t3

λ is valid if every node satisfies every of its associated types.

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 16 / 57

Page 26: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Multi-type semantics of shape expressions

A multi-type typing is a function λ : V → 2Γ that assign to every node a set of types.

G1: n0 n1 n2a c

b

S1: t0 → a :: t1

t1 → b :: t2, c :: t3

t2 → (b :: t2)?, c :: t3

t3 → ε

G2: m0 m1

a

bt0 t1 t2

S2: t0 → a :: t1, b :: t2

t1 → (c :: t1)∗

t2 → (d :: t2)?

t0 t1 t2 t3

λ1(n0) = {t0}, λ1(n1) = {t1, t2}, λ1(n2) = {t3}.

OutType(n1, λ1) = (b :: t1 | b :: t2), c :: t3

λ is valid if every node satisfies every of its associated types.

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 16 / 57

Page 27: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Multi-type semantics of shape expressions

A multi-type typing is a function λ : V → 2Γ that assign to every node a set of types.

G1: n0 n1 n2a c

b

S1: t0 → a :: t1

t1 → b :: t2, c :: t3

t2 → (b :: t2)?, c :: t3

t3 → ε

G2: m0 m1

a

bt0 t1 t2

S2: t0 → a :: t1, b :: t2

t1 → (c :: t1)∗

t2 → (d :: t2)?

t0 t1 t2 t3

λ1(n0) = {t0}, λ1(n1) = {t1, t2}, λ1(n2) = {t3}.

OutType(n1, λ1) = (b :: t1 | b :: t2), c :: t3

λ is valid if every node satisfies every of its associated types.

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 16 / 57

Page 28: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Multi-type semantics of shape expressions

A multi-type typing is a function λ : V → 2Γ that assign to every node a set of types.

G1: n0 n1 n2a c

b

S1: t0 → a :: t1

t1 → b :: t2, c :: t3

t2 → (b :: t2)?, c :: t3

t3 → ε

G2: m0 m1

a

bt0 t1 t2

S2: t0 → a :: t1, b :: t2

t1 → (c :: t1)∗

t2 → (d :: t2)?

t0 t1 t2 t3

λ1(n0) = {t0}, λ1(n1) = {t1, t2}, λ1(n2) = {t3}.

OutType(n1, λ1) = (b :: t1 | b :: t2), c :: t3

λ is valid if every node satisfies every of its associated types.

When defining that a node n satisfies a type t. . .

I we inspect the outbound neighborhood out-lab-nodeG (n) = {(a,m) | (n, a,m) ∈ E}

I a node m may assume any of its assigned types λ(m), one per edge incoming from n

I (n, a,m) ∈ E yields the choice |t∈λ(m) a :: t

I OutType(n, λ) = ,(n,a,m)∈E (|t∈λ(m) a :: t)

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 16 / 57

Page 29: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Multi-type semantics of shape expressions

A multi-type typing is a function λ : V → 2Γ that assign to every node a set of types.

G1: n0 n1 n2a c

b

S1: t0 → a :: t1

t1 → b :: t2, c :: t3

t2 → (b :: t2)?, c :: t3

t3 → ε

G2: m0 m1

a

bt0 t1 t2

S2: t0 → a :: t1, b :: t2

t1 → (c :: t1)∗

t2 → (d :: t2)?

t0 t1 t2 t3

λ1(n0) = {t0}, λ1(n1) = {t1, t2}, λ1(n2) = {t3}.

OutType(n1, λ1) = (b :: t1 | b :: t2), c :: t3

λ is valid if every node satisfies every of its associated types.

When defining that a node n satisfies a type t. . .

I we inspect the outbound neighborhood out-lab-nodeG (n) = {(a,m) | (n, a,m) ∈ E}I a node m may assume any of its assigned types λ(m), one per edge incoming from n

I (n, a,m) ∈ E yields the choice |t∈λ(m) a :: t

I OutType(n, λ) = ,(n,a,m)∈E (|t∈λ(m) a :: t)

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 16 / 57

Page 30: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Multi-type semantics of shape expressions

A multi-type typing is a function λ : V → 2Γ that assign to every node a set of types.

G1: n0 n1 n2a c

b

S1: t0 → a :: t1

t1 → b :: t2, c :: t3

t2 → (b :: t2)?, c :: t3

t3 → ε

G2: m0 m1

a

bt0 t1 t2

S2: t0 → a :: t1, b :: t2

t1 → (c :: t1)∗

t2 → (d :: t2)?

t0 t1 t2 t3

λ1(n0) = {t0}, λ1(n1) = {t1, t2}, λ1(n2) = {t3}.

OutType(n1, λ1) = (b :: t1 | b :: t2), c :: t3

λ is valid if every node satisfies every of its associated types.

When defining that a node n satisfies a type t. . .

I we inspect the outbound neighborhood out-lab-nodeG (n) = {(a,m) | (n, a,m) ∈ E}I a node m may assume any of its assigned types λ(m), one per edge incoming from n

I (n, a,m) ∈ E yields the choice |t∈λ(m) a :: t

I OutType(n, λ) = ,(n,a,m)∈E (|t∈λ(m) a :: t)

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 16 / 57

Page 31: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Multi-type semantics of shape expressions

A multi-type typing is a function λ : V → 2Γ that assign to every node a set of types.

G1: n0 n1 n2a c

b

S1: t0 → a :: t1

t1 → b :: t2, c :: t3

t2 → (b :: t2)?, c :: t3

t3 → ε

G2: m0 m1

a

bt0 t1 t2

S2: t0 → a :: t1, b :: t2

t1 → (c :: t1)∗

t2 → (d :: t2)?

t0 t1 t2 t3

λ1(n0) = {t0}, λ1(n1) = {t1, t2}, λ1(n2) = {t3}.

OutType(n1, λ1) = (b :: t1 | b :: t2), c :: t3

λ is valid if every node satisfies every of its associated types.

When defining that a node n satisfies a type t. . .

I we inspect the outbound neighborhood out-lab-nodeG (n) = {(a,m) | (n, a,m) ∈ E}I a node m may assume any of its assigned types λ(m), one per edge incoming from n

I (n, a,m) ∈ E yields the choice |t∈λ(m) a :: t

I OutType(n, λ) = ,(n,a,m)∈E (|t∈λ(m) a :: t)

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 16 / 57

Page 32: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Multi-type semantics of shape expressions

A multi-type typing is a function λ : V → 2Γ that assign to every node a set of types.

G1: n0 n1 n2a c

b

S1: t0 → a :: t1

t1 → b :: t2, c :: t3

t2 → (b :: t2)?, c :: t3

t3 → ε

G2: m0 m1

a

bt0 t1 t2

S2: t0 → a :: t1, b :: t2

t1 → (c :: t1)∗

t2 → (d :: t2)?

t0 t1 t2 t3

λ1(n0) = {t0}, λ1(n1) = {t1, t2}, λ1(n2) = {t3}.

OutType(n1, λ1) = (b :: t1 | b :: t2), c :: t3

λ is valid if every node satisfies every of its associated types.

n satisfies t w.r.t. λ if OutType(n, λ) ∩ δ(t) 6= ∅,where OutType(n, λ) = ,

(n,a,m)∈E (|t∈λ(m) a :: t)

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 16 / 57

Page 33: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Refinement algorithm

The set of all valid multi-type typings of G w.r.t. S is a semi-lattice.

1. Start with the universal typing λ(n) := Γ

2. Iteratively refine it λ := Refine(λ)

[Refine(λ)](n) = {t ∈ λ(n) | OutType(n, λ) ∩ δ(t) 6= ∅}.

3. Until a fix-point is reached

4. The graph satisfies the schema iff the fix-point λ is valid . . .. . . and then λ is also the maximal valid multi-type typing.

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 17 / 57

Page 34: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Satisfiability of RBEs

OutType yields expressions of the form (RBE1)

(a1 | · · · | ak), · · · , (z1 | . . . | zm)

The essential decision problem for a class C of RBEs used in the ShEx schema is

INTER1(C) = {(E0,E) ∈ RBE1 × C | E0 ∩ E 6= ∅}.

LemmaTractability of INTER1 is a necessary and sufficient condition for tractability of multi-type validation.

Corollary

Multi-type validation is NP-complete.

TheoremMulti-type validation for ShEx0 = ShEx(RBE0) is in PTIME

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 18 / 57

Page 35: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Determinism

Determinism of shape expressions

Given the type (of a node) and the label of an outgoing edge, the expression specifies the type that the targetnode must satisfy.

a :: t1, b :: t∗2 , a :: t1, c :: t2 (a :: t1, b :: t2) | (a :: t3, c :: t4) a :: t1, b :: t∗2 , a :: t3

deterministic not deterministic not deterministic

LemmaFor schemas using only deterministic shape expressions, tractability of membership is a sufficient and necessarycondition for tractability of multi-type validation

Proof sketch

I Knowing the label a of an outgoing edge determines the type ta for the target node

I OutType(n, λ) = ,(n,a,m)∈E (|t∈λ(m) a :: t) becomes ,

(n,a,m)∈E (a :: ta)

I ,(n,a,m)∈E (a :: ta) defines a singleton {w} with w = {|a :: ta | (n, a,m)|}

I OutType(n, λ) ∩ δ(t) 6= ∅ ≡ w ∈ δ(t).

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 19 / 57

Page 36: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Determinism

Determinism of shape expressions

Given the type (of a node) and the label of an outgoing edge, the expression specifies the type that the targetnode must satisfy.

a :: t1, b :: t∗2 , a :: t1, c :: t2 (a :: t1, b :: t2) | (a :: t3, c :: t4) a :: t1, b :: t∗2 , a :: t3

deterministic not deterministic not deterministic

LemmaFor schemas using only deterministic shape expressions, tractability of membership is a sufficient and necessarycondition for tractability of multi-type validation

Proof sketch

I Knowing the label a of an outgoing edge determines the type ta for the target node

I OutType(n, λ) = ,(n,a,m)∈E (|t∈λ(m) a :: t) becomes ,

(n,a,m)∈E (a :: ta)

I ,(n,a,m)∈E (a :: ta) defines a singleton {w} with w = {|a :: ta | (n, a,m)|}

I OutType(n, λ) ∩ δ(t) 6= ∅ ≡ w ∈ δ(t).

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 19 / 57

Page 37: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Single-occurrence REBs (SORBEs)

SORBE allows a symbol to be used at most once in an expression but also allows a[n,m]

TheoremMembership for SORBE is in PTIME :)

a :: t1, b :: t∗2 , a :: t1 (a :: t1, b :: t2) | (a :: t3, c :: t4) (a :: t1, b :: t2)∗, c :: t3

deterministic not deterministic deterministicbut yet and

not single-occurrence single-occurence single-occurrence

TheoremMulti-type validation for deterministic shape expressions using SORBE is in PTIME. :)

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 20 / 57

Page 38: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Expressive power of ShEx

G<: a

b

c

c

G�: a

b

c

c

FOG ShExm ShExs ∃MSOG

L : G� ∈ L, G< ∈ L 3 3 3 3

L : G� 6∈ L, G< ∈ L 3 5 3 3

L : G� ∈ L, G< 6∈ L 3 5 5 3

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 21 / 57

Page 39: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Quick recap

Complexity

RBE0 RBE SORBESORBE

det.multi-type PTIME NP-complete PTIMEsingle-type NP-complete

Expressive power

I automata-like formalism

I incomparable with FO and MSO (unless we forbid ∗ over expressions)

I captured with MSO+PA

I incomparable with NR and HR graph grammars

I closed under intersection but not under union or negation

I single-type semantics is more expressive than multi-type semantics

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 22 / 57

Page 40: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Containment of ShEx

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 23 / 57

Page 41: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Containment problem

Containment S1 ⊆ S2

Does every graph that satisfies S1 also satisfies S2?

Motivation

I Fundamental problem (static analysis: query optimization, schema minimization etc.)

I Inference of ShEx (work in progress)

G1

Positive example

G2

Negative example

S1 S2 S3Generalization

⊆Over-generalization

The challenge

I RBEs = Presburger Arithmetic (PA)

I MSOG 6⊇ ShEx ⊆ MSOG + PA

I MSOG with very little arithmetic becomes undecidable [Elgot&Rabin’66]

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 24 / 57

Page 42: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Decidability of Containment

S1 : t0 → a :: t*1

t1 → b :: t?1

6⊆S2 : s0 → a :: s1 | (a :: s1, a :: s2)*

s1 → b :: s?2 s2 → ε

G : ({t0}, {})

({t1}, {s1})({t1}, {s1})

({t1}, {s1, s2})

aa

a

b b

G : ({t0}, {s0})

({t1}, {s1})

({t1}, {s1, s2})

aa

b

G : ({t0}, {})

({t1}, {s1})

({t1}, {s1, s2})

aaa

b

G : ({t0}, {})

({t1}, {s1})

({t1}, {s1, s2})

a1a 2

b1

Can we boundthese values?

Containment of ShEx is in co2NEXPNP

I The counter-example is a graph with at most exponential number of nodes, one node per (A,B)-kind

I There is a PA formula ϕ that describes the multiplicities

I PA enjoys an upper bound O(|ϕ|3|x̄|k

) on minimal solutions [Weispfenning’90]

I Double exponential upper bound on the (binary) size of the values of multiplicities

I Validation of graphs with multiplicities remains in NP

I Containment of commutative REs recently shown to be coNEXP-hard [Haase&Hofman’16]

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 25 / 57

Page 43: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Decidability of Containment

S1 : t0 → a :: t*1

t1 → b :: t?1

6⊆S2 : s0 → a :: s1 | (a :: s1, a :: s2)*

s1 → b :: s?2 s2 → ε

G : ({t0}, {})

({t1}, {s1})({t1}, {s1})

({t1}, {s1, s2})

aa

a

b b

G : ({t0}, {s0})

({t1}, {s1})

({t1}, {s1, s2})

aa

b

G : ({t0}, {})

({t1}, {s1})

({t1}, {s1, s2})

aaa

b

G : ({t0}, {})

({t1}, {s1})

({t1}, {s1, s2})

a1a 2

b1

Can we boundthese values?

Containment of ShEx is in co2NEXPNP

I The counter-example is a graph with at most exponential number of nodes, one node per (A,B)-kind

I There is a PA formula ϕ that describes the multiplicities

I PA enjoys an upper bound O(|ϕ|3|x̄|k

) on minimal solutions [Weispfenning’90]

I Double exponential upper bound on the (binary) size of the values of multiplicities

I Validation of graphs with multiplicities remains in NP

I Containment of commutative REs recently shown to be coNEXP-hard [Haase&Hofman’16]

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 25 / 57

Page 44: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Decidability of Containment

S1 : t0 → a :: t*1

t1 → b :: t?1

6⊆S2 : s0 → a :: s1 | (a :: s1, a :: s2)*

s1 → b :: s?2 s2 → ε

G : ({t0}, {})

({t1}, {s1})({t1}, {s1})

({t1}, {s1, s2})

aa

a

b b

G : ({t0}, {s0})

({t1}, {s1})

({t1}, {s1, s2})

aa

b

G : ({t0}, {})

({t1}, {s1})

({t1}, {s1, s2})

aaa

b

G : ({t0}, {})

({t1}, {s1})

({t1}, {s1, s2})

a1a 2

b1

Can we boundthese values?

Containment of ShEx is in co2NEXPNP

I The counter-example is a graph with at most exponential number of nodes, one node per (A,B)-kind

I There is a PA formula ϕ that describes the multiplicities

I PA enjoys an upper bound O(|ϕ|3|x̄|k

) on minimal solutions [Weispfenning’90]

I Double exponential upper bound on the (binary) size of the values of multiplicities

I Validation of graphs with multiplicities remains in NP

I Containment of commutative REs recently shown to be coNEXP-hard [Haase&Hofman’16]

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 25 / 57

Page 45: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Decidability of Containment

S1 : t0 → a :: t*1

t1 → b :: t?1

6⊆S2 : s0 → a :: s1 | (a :: s1, a :: s2)*

s1 → b :: s?2 s2 → ε

G : ({t0}, {})

({t1}, {s1})({t1}, {s1})

({t1}, {s1, s2})

aa

a

b b

G : ({t0}, {s0})

({t1}, {s1})

({t1}, {s1, s2})

aa

b

G : ({t0}, {})

({t1}, {s1})

({t1}, {s1, s2})

aaa

b

G : ({t0}, {})

({t1}, {s1})

({t1}, {s1, s2})

a1a 2

b1

Can we boundthese values?

Containment of ShEx is in co2NEXPNP

I The counter-example is a graph with at most exponential number of nodes, one node per (A,B)-kind

I There is a PA formula ϕ that describes the multiplicities

I PA enjoys an upper bound O(|ϕ|3|x̄|k

) on minimal solutions [Weispfenning’90]

I Double exponential upper bound on the (binary) size of the values of multiplicities

I Validation of graphs with multiplicities remains in NP

I Containment of commutative REs recently shown to be coNEXP-hard [Haase&Hofman’16]

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 25 / 57

Page 46: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Decidability of Containment

S1 : t0 → a :: t*1

t1 → b :: t?1

6⊆S2 : s0 → a :: s1 | (a :: s1, a :: s2)*

s1 → b :: s?2 s2 → ε

G : ({t0}, {})

({t1}, {s1})({t1}, {s1})

({t1}, {s1, s2})

aa

a

b b

G : ({t0}, {s0})

({t1}, {s1})

({t1}, {s1, s2})

aa

b

G : ({t0}, {})

({t1}, {s1})

({t1}, {s1, s2})

aaa

b

G : ({t0}, {})

({t1}, {s1})

({t1}, {s1, s2})

a1a 2

b1

Can we boundthese values?

Containment of ShEx is in co2NEXPNP

I The counter-example is a graph with at most exponential number of nodes, one node per (A,B)-kind

I There is a PA formula ϕ that describes the multiplicities

I PA enjoys an upper bound O(|ϕ|3|x̄|k

) on minimal solutions [Weispfenning’90]

I Double exponential upper bound on the (binary) size of the values of multiplicities

I Validation of graphs with multiplicities remains in NP

I Containment of commutative REs recently shown to be coNEXP-hard [Haase&Hofman’16]

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 25 / 57

Page 47: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Decidability of Containment

S1 : t0 → a :: t*1

t1 → b :: t?1

6⊆S2 : s0 → a :: s1 | (a :: s1, a :: s2)*

s1 → b :: s?2 s2 → ε

G : ({t0}, {})

({t1}, {s1})({t1}, {s1})

({t1}, {s1, s2})

aa

a

b b

G : ({t0}, {s0})

({t1}, {s1})

({t1}, {s1, s2})

aa

b

G : ({t0}, {})

({t1}, {s1})

({t1}, {s1, s2})

aaa

b

G : ({t0}, {})

({t1}, {s1})

({t1}, {s1, s2})

a1a 2

b1

Can we boundthese values?

Containment of ShEx is in co2NEXPNP and coNEXP-hard

I The counter-example is a graph with at most exponential number of nodes, one node per (A,B)-kind

I There is a PA formula ϕ that describes the multiplicities

I PA enjoys an upper bound O(|ϕ|3|x̄|k

) on minimal solutions [Weispfenning’90]

I Double exponential upper bound on the (binary) size of the values of multiplicities

I Validation of graphs with multiplicities remains in NP

I Containment of commutative REs recently shown to be coNEXP-hard [Haase&Hofman’16]

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 25 / 57

Page 48: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

ShEx0

I no disjunction (a :: t1 | b :: t2) and no grouping (a :: t1, b :: t2)*

I Shape Graphs – an equivalent graphical representation

bug1 bug2

bug3 bug4

user1

user2

emp1

“Boom!” “Kaboom!”

“Kabang!” “Bang!”

“John”“Mary” “[email protected]

“Steve”“[email protected]

name

name

email

name

email

relatedrelated

reportedBy

reproducedBy

descr

related r

eportedBy

descr

reportedBy

descr

reportedBy

descr

Bug→ descr :: str, reportedBy :: User, reproducedBy :: Employee?, related :: Bug*

User→ name :: str, email :: str?

Employee→ name :: str, email :: str

Bug

User Employee

str

related

*

reportedBy

1

reproducedBy

?

descr

1

name

1email?

name

1email

1

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 26 / 57

Page 49: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Embeddings

I Graph morphism with occurrence constraints, closely related to graph simulations

I Capture semantics of ShEx0 by means of structural comparison

I Generalize naturally to pairs of shape graphs

bug1bug3

user1 emp1

“Boom!”“Kabang!” “John” “Mary” “[email protected]

name

name

email

related

reportedBy

reproducedBy

descr

reportedBy

descr

Bug

User Employee

string

related

*

reportedBy

1

reproducedBy

?

descr

1

name

1email?

name

1email

1

Bug

Person

string

related

*

reportedBy

1

reproducedBy

?

descr

1

name

1

email

?

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 27 / 57

Page 50: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Embeddings

I Graph morphism with occurrence constraints, closely related to graph simulations

I Capture semantics of ShEx0 by means of structural comparison

I Generalize naturally to pairs of shape graphs

bug1bug3

user1 emp1

“Boom!”“Kabang!” “John” “Mary” “[email protected]

name

name

email

related

reportedBy

reproducedBy

descr

reportedBy

descr

Bug

User Employee

string

related

*

reportedBy

1

reproducedBy

?

descr

1

name

1email?

name

1email

1

Bug

Person

string

related

*

reportedBy

1

reproducedBy

?

descr

1

name

1

email

?

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 27 / 57

Page 51: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Properties of embeddings

Embedding and containment

I Embedding implies containment

I In general, the converse does not hold

H :

a *

b *

K :

a* a *

b

a*

b

b*

H cannot be embedded into K (b :: t* is equivalent to ε | b :: t | b :: t+)

TheoremConstructing embeddings is

I in PTIME if only 1, ?, *, + are used

I NP-complete if arbitrary occurrence constraints are allowed a :: t [n;m]

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 28 / 57

Page 52: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

When does containment implies embedding?

Determinism

I DetShEx0 every type uses each predicate symbol at most once

I DetShEx−0 no + are allowed and ? must be dominated by *

Characterizing graph

For any H ∈ DetShEx−0 there is a polynomially-sized graph G characterizing H under containment i.e.,

∀K ∈ DetShEx-0. G satisfies K ⇒ H ⊆ K .

TheoremContainment for DetShEx−0 is in PTIME

H:*

*

?

*

*

?

G :

TheoremContainment for DetShEx0 is coNP-hard

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 29 / 57

Page 53: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Two equivalent ShEx0 schemas and their shape graphs

H :

Bug→ descr :: str, reportedBy :: User, reproducedBy :: Employee?,

related :: Bug*

User→ name :: str, email :: str?

Employee→ name :: str, email :: str

K :

User1 → name :: str

User2 → name :: str, email :: str

Bug1 → descr :: str, reportedBy :: User1, reproducedBy :: Employee?,

related :: Bug*1, related :: Bug*

2

Bug2 → descr :: str, reportedBy :: User2, reproducedBy :: Employee?,

related :: Bug*1, related :: Bug*

2

Employee→ name :: str, email :: str

H:

B

U E

L

r*

u

e?

d

n

m? m

n

K :

B1B2

U1U2

E

L

r*

r*

u

e?

d

r*

r*

u

e?

d

mn

n

n

m

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 30 / 57

Page 54: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Coverings

Generalization of embeddings

A type t is covered by a set of types S = {s1, . . . , sk} iff any node satisfying t also satisfies one of the types in S

H:

B

U E

L

r*

u

e?

d

n

m? m

n

K :

B1B2

U1U2

E

L

r*

r*

u

e?

d

r*

r*

u

e?

d

mn

n

n

m

Lemma (Constructing covering)

Covering is the maximum relation R ⊆ Types(H)× P(Types(K)) such that

∀(t,S) ∈ R. def(t)Unfold−−−→R {def(s) | s ∈ S}.

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 31 / 57

Page 55: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Unfolding

H:

B

U E

L

r*

u

e?

d

n

m? m

n

K :

B1B2

U1U2

E

L

r*

r*

u

e?

d

r*

r*

u

e?

d

mn

n

n

m

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 32 / 57

Page 56: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Unfolding

H:

B

U E

L

r*

u

e?

d

n

m? m

n

K :

B1B2

U1U2

E

L

r*

r*

u

e?

d

r*

r*

u

e?

d

mn

n

n

m

Unfolding U into {U1,U2}

U → n :: L, m :: L? ≡ n :: L, (ε | m :: L) ≡ (n :: L) | (n :: L, m :: L)← U1 | U2

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 32 / 57

Page 57: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Unfolding

H:

B

U E

L

r*

u

e?

d

n

m? m

n

K :

B1B2

U1U2

E

L

r*

r*

u

e?

d

r*

r*

u

e?

d

mn

n

n

m

Unfolding B into {B1,B2}

B → r :: B*, u :: U, d :: L, e :: E ?

≡ (r :: B*, u :: U1, d :: L, e :: E ?) | (r :: B*, u :: U2, d :: L, e :: E ?)

≡ (r :: B*1 , r :: B*

2 , u :: U1, d :: L, e :: E ?) | (r :: B*1 , r :: B*

2 , u :: U2, d :: L, e :: E ?)

← B1 | B2

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 32 / 57

Page 58: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Complexity of ShEx0

TheoremContainment for ShEx0 is in EXP

I Covering is a relation of exponential size

I Covering can be obtained with an iterative refinement process(starting with maximal relation and remove at least one element at each iteration until stabilization)

I At each step unfoldings are constructed and each unfolding is a tree whose size is bounded exponentially

TheoremContainment for ShEx0 is EXP-complete

I Reduction from containment for binary tree automata

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 33 / 57

Page 59: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Quick recap

I Containment for ShEx is decidable

I There is a (arguably practical) class DetShEx-0 with tractable containment

I ShEx is very different from tree automata and requires novel techniques

ShEx DetShEx ShEx0 DetShEx0 DetShEx-0

coNEXP-h and co2EXPNP co2EXP EXP-c coNP-h PTIME

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 34 / 57

Page 60: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Inference of ShEx

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 35 / 57

Page 61: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Constructing ShEx from an RDF Graph

What for?

I RDF originally schema-free but schemas are useful

I Free-range RDF is dirty

I Industrial-strength RDF is relatively clean and exhibits regular structure

Relational databaseR2RDF−−−−→ RDF

What do we want?For a given RDF Graph G construct a ShEx schema S that captures the structure of G :

Soundness G satisfies S

Succinctness S is small enough for a human user to consume

Specificity S is not overly general (as the universal schema is)

Implicit node similarity

Two nodes should have the same type if they are similar in some way. Two possible criteria:

I content – the outbound neighborhood is similar

I context – the inbound neighborhood is similar

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 36 / 57

Page 62: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Constructing ShEx from an RDF Graph

What for?

I RDF originally schema-free but schemas are useful

I Free-range RDF is dirty

I Industrial-strength RDF is relatively clean and exhibits regular structure

Relational databaseR2RDF−−−−→ RDF

What do we want?For a given RDF Graph G construct a ShEx schema S that captures the structure of G :

Soundness G satisfies S

Succinctness S is small enough for a human user to consume

Specificity S is not overly general (as the universal schema is)

Implicit node similarity

Two nodes should have the same type if they are similar in some way. Two possible criteria:

I content – the outbound neighborhood is similar

I context – the inbound neighborhood is similar

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 36 / 57

Page 63: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Inference and Fitting

Fitting

For a input G construct ⊆-minimal schema S that G satisfies.

InferenceAn inference algorithm A is an algorithm that is

I sound i.e., it returns a schema that the input graph satisfies;

I complete i.e., it can return any goal schema provided that the input graph is sufficiently informative (or richenough).

Both approaches are parameterized by a class C of goal ShEx schemas.

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 37 / 57

Page 64: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Bad news for Fitting: Overfitting

Fitting for ShEx0 is trivial and potentially too verbose

An RDF graph interpreted as a shape graph is its own ShEx0 fitting

bug1 bug2

bug3 bug4

user1

user2

emp1

“Boom!” “Kaboom!”

“Kabang!” “Bang!”

“John”“Mary” “[email protected]

“Steve”“[email protected]

name

name

email

name

email

relatedrelated

reportedBy

reproducedBy

descr

related r

eportedBy

descr

reportedBy

descr

reportedBy

descr

Bisimulation can identify redundant/repetitive information

But it won’t generalize the result (it won’t introduce * or recursion).

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 38 / 57

Page 65: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Bad news for Fitting: Overfitting

Fitting for ShEx0 is trivial and potentially too verbose

An RDF graph interpreted as a shape graph is its own ShEx0 fitting

“Boom!” “Kaboom!”

“Kabang!” “Bang!”

“John”“Mary” “[email protected]

“Steve”“[email protected]

name

name

email

name

email

relatedrelated

reportedBy

reproducedBy

descr

related r

eportedBy

descr

reportedBy

descr

reportedBy

descr

Bisimulation can identify redundant/repetitive information

But it won’t generalize the result (it won’t introduce * or recursion).

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 38 / 57

Page 66: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Bad news for Fitting: Overfitting

Fitting for ShEx0 is trivial and potentially too verbose

An RDF graph interpreted as a shape graph is its own ShEx0 fitting

string

name

name

email

name

email

descr de

scr

descr

descr

relatedrelated

reportedBy

reproducedBy

related r

eportedBy

reportedBy

reportedBy

Bisimulation can identify redundant/repetitive information

But it won’t generalize the result (it won’t introduce * or recursion).

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 38 / 57

Page 67: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Bad news for Fitting: Overfitting

Fitting for ShEx0 is trivial and potentially too verbose

An RDF graph interpreted as a shape graph is its own ShEx0 fitting

string

name

name

email

name

email

descr de

scr

descr

descr

relatedrelated

reportedBy

reproducedBy

related r

eportedBy

reportedByreportedBy

reportedBy

Bisimulation can identify redundant/repetitive information

But it won’t generalize the result (it won’t introduce * or recursion).

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 38 / 57

Page 68: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Bad news for Inference: Limit Point

Limit point

A class of languages C has the limit point property iff C contains an ascending chain of languages L1 ( L2 ( . . .whose limit point L∞ =

⋃i Li also belongs to C.

Folklore result: Limit point precludes inference

No family with limit point property has an inference algorithm.

L0:a

L1:

a?

aL2:

a? a?

a. . . L∞: a?

LemmaFor any M ⊆ {1, ?, *, +} with at least two elements the class ShEx0(M) has the limit point property.

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 39 / 57

Page 69: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

So. . . What do we do?

We compromise

I Find a suitable subclass that allows inference and remains relatively practical.

I Motivation: first draft of a ShEx schema for an architect to work on.

Our results

ShEx0 SingShEx0 DetShEx0 Typed graphFitting trivial undefined (?) exponential NP-hard

Inference unfeasible SingShEx0(1, *) DetShEx−0 ShEx0 (?)

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 40 / 57

Page 70: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

So. . . What do we do?

We compromise

I Find a suitable subclass that allows inference and remains relatively practical.

I Motivation: first draft of a ShEx schema for an architect to work on.

Our results

ShEx0 SingShEx0 DetShEx0 Typed graphFitting trivial undefined (?) exponential NP-hard

Inference unfeasible SingShEx0(1, *) DetShEx−0 ShEx0 (?)

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 40 / 57

Page 71: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Embeddings (recall)

I Generalize simulations

I Capture semantics of ShEx by means of structural comparison

bug1bug3

user1 emp1

“Boom!”“Kabang!” “John” “Mary” “[email protected]

name

name

emailrelated

reportedBy

reproducedBy

descr

reportedBy

descr

Bug

User Employee

string

related

*

reportedBy

1

reproducedBy

?

descr

1

name

1email?

name

1email

1

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 41 / 57

Page 72: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Singular Shape Expression Schemas

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 42 / 57

Page 73: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Inference of SingShEx0: Generalization

G0 bug1 bug2

bug3 bug4

user1

user2

emp1

“Boom!” “Kaboom!”

“Kabang!” “Bang!”

“John”“Mary” “[email protected]

“Steve”“[email protected]

name

name

email

name

email

relatedrelated

reportedBy

reproducedBy

descr

related r

eportedBy

descr

reportedBy

descr

reportedBy

descr

1. Put * on every edge of the input graph

2. Construct the autoembedding

3. Remove any dominated nodes

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 43 / 57

Page 74: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Inference of SingShEx0: Generalization

G0

“Boom!” “Kaboom!”

“Kabang!” “Bang!”

“John”“Mary” “[email protected]

“Steve”“[email protected]

name

name

email

name

email

relatedrelated

reportedBy

reproducedBy

descr

related r

eportedBy

descr

reportedBy

descr

reportedBy

descr

1. Put * on every edge of the input graph

2. Construct the autoembedding

3. Remove any dominated nodes

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 43 / 57

Page 75: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Inference of SingShEx0: Generalization

G0

string

name

name

email

name

email

descr de

scr

descr

descr

relatedrelated

reportedBy

reproducedBy

related r

eportedBy

reportedBy

reportedBy

1. Put * on every edge of the input graph

2. Construct the autoembedding

3. Remove any dominated nodes

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 43 / 57

Page 76: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Inference of SingShEx0: Generalization

G?

string

name*

name

*email

*

name

*

email

*

related*related

*

reportedBy

*

reproducedBy

*descr

*

related

*

reportedBy

*descr

*

reportedBy

*descr*

reportedBy

*descr

*

1. Put * on every edge of the input graph

2. Construct the autoembedding

3. Remove any dominated nodes

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 43 / 57

Page 77: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Inference of SingShEx0: Reduction

G?

string

name*

name

*email

*

name

*

email

*

related*related

*

reportedBy

*

reproducedBy

*descr

*

related

*

reportedBy

*descr

*

reportedBy

*descr*

reportedBy

*descr

*

1. Put * on every edge of the input graph

2. Construct the autoembedding

3. Remove any dominated nodes

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 43 / 57

Page 78: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Inference of SingShEx0: Reduction

G?

string

name*

name

*email

*

name

*

email

*

related*related

*

reportedBy

*

reproducedBy

*descr

*

related

*

reportedBy

*descr

*

reportedBy

*descr*

reportedBy

*descr

*

G≺

string

name

*

email

*

related

*

reportedBy

*

reproducedBy

*

descr

*

1. Put * on every edge of the input graph

2. Construct the autoembedding

3. Remove any dominated nodes

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 43 / 57

Page 79: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Inference of SingShEx0: Reduction in detail

G?

string

name*

name

*

email

*

related*related

*

reportedBy

*reportedBy

*

reproducedBy

*descr

*

descr*

reportedBy

*descr

*

Removing a node n dominated by m

I remove n and all of its outgoing edges

I redirect its incoming edges to m

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 44 / 57

Page 80: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Inference of SingShEx0: Reduction in detail

string

name*

name

*

email

*

related*related

*

reportedBy

*

reportedBy*

reproducedBy

*descr

*

descr*

reportedBy

*descr

*

Removing a node n dominated by m

I remove n and all of its outgoing edges

I redirect its incoming edges to m

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 44 / 57

Page 81: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Inference of SingShEx0: Reduction in detail

stringname

*

email

*

related*related

*

reportedBy

*

reportedBy*

reproducedBy

*descr

*

descr*

reportedBy

*descr

*

Removing a node n dominated by m

I remove n and all of its outgoing edges

I redirect its incoming edges to m

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 44 / 57

Page 82: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Inference of SingShEx0: Reduction in detail

stringname

*

email

*

related*related*

reportedBy

*

reportedBy*

reproducedBy

*descr

*

descr*

reportedBy

*descr

*

Removing a node n dominated by m

I remove n and all of its outgoing edges

I redirect its incoming edges to m

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 44 / 57

Page 83: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Inference of SingShEx0: Reduction in detail

string

name

*

email

*

related*

reportedBy

*

reproducedBy

*

descr

*

reportedBy

*

descr

*Removing a node n dominated by m

I remove n and all of its outgoing edges

I redirect its incoming edges to m

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 44 / 57

Page 84: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Inference of SingShEx0: Reduction in detail

string

name

*

email

*

related

*

reportedBy

*

reproducedBy

*

descr

*

reportedBy

*

descr

*Removing a node n dominated by m

I remove n and all of its outgoing edges

I redirect its incoming edges to m

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 44 / 57

Page 85: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Inference of SingShEx0: Reduction in detail

Reduce(G?)

string

name

*

email

*

related

*

reportedBy

*

reproducedBy

*

descr

*

Removing a node n dominated by m

I remove n and all of its outgoing edges

I redirect its incoming edges to m

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 44 / 57

Page 86: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Inference of SingShEx0: Specialization

G0 bug1 bug2

bug3 bug4

user1

user2

emp1

“Boom!” “Kaboom!”

“Kabang!” “Bang!”

“John”“Mary” “[email protected]

“Steve”“[email protected]

name

name

email

name

email

relatedrelated

reportedBy

reproducedBy

descr

related r

eportedBy

descr

reportedBy

descr

reportedBy

descr

string

name

*

email

*

related

*

reportedBy

*

reproducedBy

*

descr

*

I Construct the embedding of the input graph G into the reduct of G *

I Use the embedding to replace *’s with fitter multiplicities

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 45 / 57

Page 87: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Inference of SingShEx0: Specialization

G0 bug1 bug2

bug3 bug4

user1

user2

emp1

“Boom!” “Kaboom!”

“Kabang!” “Bang!”

“John”“Mary” “[email protected]

“Steve”“[email protected]

name

name

email

name

email

relatedrelated

reportedBy

reproducedBy

descr

related r

eportedBy

descr

reportedBy

descr

reportedBy

descr

string

name

*

email

*

related

*

reportedBy

*

reproducedBy

*

descr

*

I Construct the embedding of the input graph G into the reduct of G *

I Use the embedding to replace *’s with fitter multiplicities

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 45 / 57

Page 88: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Inference of SingShEx0: Specialization

G0 bug1 bug2

bug3 bug4

user1

user2

emp1

“Boom!” “Kaboom!”

“Kabang!” “Bang!”

“John”“Mary” “[email protected]

“Steve”“[email protected]

name

name

email

name

email

relatedrelated

reportedBy

reproducedBy

descr

related r

eportedBy

descr

reportedBy

descr

reportedBy

descr

string

name

1

email

?

related

*

reportedBy

1

reproducedBy

?

descr

1

I Construct the embedding of the input graph G into the reduct of G *

I Use the embedding to replace *’s with fitter multiplicities

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 45 / 57

Page 89: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Inference of SingShEx0: Specialization

G◦

b

u&e

string

name

1

email

?

related

*

reportedBy

1

reproducedBy

?

descr

1

I Construct the embedding of the input graph G into the reduct of G *

I Use the embedding to replace *’s with fitter multiplicities

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 45 / 57

Page 90: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Singular Shape Expression Schemas

Definition (SingShEx0)

A shape graph H is singular if

1. H? is reduced i.e., there are no two types t1 and t2 in H? that t1 can be embedded into t2,

2. H? has no two edges with the same label and the same source and target nodes.

Singularity as a restriction is

I stronger than forbidding any two types t1 and t2 such that t1 ⊆ t2.

I weaker than forbidding any two types with comparable signatures (sets of outgoing edge labels).

t1

a b

t2

a

Not singular

s1

a

b

s2

a

c

Singular

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 46 / 57

Page 91: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Deterministic Shape Expression Schemas

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 47 / 57

Page 92: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Deterministic Shape Expression Schemas

Definition (DetShEx0)

A shape graph H is deterministic if for every label a every node has at most one outgoing edges labeled with a.

n

n1 n2 n3

a

a b

b

c

c d

d

Not deterministic

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 48 / 57

Page 93: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Determinizing Shape Graphs

n

n1 n2 n3

aa b

b

c

c d

d

Det

n

{n1, n2} {n2, n3}

a*

b*

c

d?

d

c?

Determinization of a shape graph

I gives the fitting of the input graph(the ⊆-minimal DetShEx0 schema that validates the input graph)

I might produce exponential output

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 49 / 57

Page 94: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Quotient Determinization of Shape Graphs

Avoid explosion by refusing to duplicate types

I if n1 is fused with n2 and n2 is fused with n3, then fuse n1, n2, and n3 together.

I produces linear output

n

n1 n2 n3

a

a b

b

c

c dd

Det◦

n

{n1, n2, n3}

a * b*

c ? d?

Quotient determinization is an inference algorithm for DetShEx−0

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 50 / 57

Page 95: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Example

bug1

bug3 bug4

emp1 user1

string

nameemail

name

descr

descr

descr

relatedrelated

reportedBy

reportedBy

reportedBy

reproducedBy

Det◦

bug1

{bug3,bug4}

{user1,emp1}

string

name

email

?

descr

reproducedBy

?reportedBy

descr

related

*

reportedBy

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 51 / 57

Page 96: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Conclusions and Future Work

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 52 / 57

Page 97: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Summary

What is ShEx?

I Automata-like formalism for graphs but unlike any automata we have seen.

I Strong connections to graph bisimulation.

Complexity of ValidationShEx0 ShEx DetShEx (SORBE)

multi-type PTIME NP-complete PTIMEsingle-type NP-complete

Complexity of Containment

ShEx DetShEx(SORBE) ShEx0 DetShEx0 DetShEx-0

coNEXP-h and co2EXPNP co2EXP EXP-c coNP-h PTIME

Inference and Fitting

ShEx0 SingShEx0 DetShEx0 Typed graphFitting trivial undefined (?) exponential NP-hard

Inference unfeasible SingShEx0(1, *) DetShEx−0 ShEx0 (?)

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 53 / 57

Page 98: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Future work: Repairing RDF

Bug

User Employee

string

reportedBy

1

reproducedBy

?

name

1

email?

name

1email

1

b1

p1

“Mary”

name

reproducedBy

Repair

Insert

1

b1

p1

“Mary”

name

name

email

reportedBy

reproducedBy

Modify

2

b1

p1

“Mary”

name

reportedBy

Delete

3

b1

p1

“Mary”

name

reproducedBy

What is thy bidding, My master?

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 54 / 57

Page 99: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Questions

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 55 / 57

Page 100: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Appendix

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 56 / 57

Page 101: Shape Expression Schemas for RDFresearchers.lille.inria.fr/~staworko/papers/staworko-alga-19-talk.pdf · S lawek S. (ULille & INRIA LINKS)ShEx for RDFGT-ALGA, Paris 20191/57. What

Formal Definition of Grammatical Inference of Shape Expression Schemas

DefinitionA class of shape graphs C is learnable in polynomial time and data from a class of graphs G iff there exists apolynomial inference algorithm A such that the following two conditions are satisfied:

Soundness For every input graph G ∈ G the inference algorithm returns a graph schema A(G) = H suchthat H ∈ C and G ∈ L(H).

Completeness For every graph schema H ∈ C there exists a polynomially-sized characteristic graph GH ∈ L(H)such that for any G that extends GH consistently with H we have A(G ′) ≡ H.

When does G extend G ′?3 alternative definitions

1. G is a disjoint union of G ′ and some G ′′

2. G is obtained from G ′ by adding new nodes and new edges

3. as 2. but no node of G may loose a type as a result of adding outgoing edges to it.

S lawek S. (ULille & INRIA LINKS) ShEx for RDF GT-ALGA, Paris 2019 57 / 57