the web of data: do we actually understand what we built?

Creative Commons CC BY 3.0: allowed to share & remix (also commercial)but must attribute

Frank van Harmelen

The Web of Data: do we actually understand

what we built?

(pssst: Our theory has fallen way behind our technology,our “why” is much weaker than our “how”)

Some expectation management

• Speculation• Questions• Hypotheses

If we knew what we were talking about, it

wouldn’t be called research

Intro:philosophical

stance

Computer Science should be like a natural science:studying objects in the information universe, and the laws that govern them.

And yes, I believe that the information universe exists and can be studied

Computer Science, = Telescope science ?

"Computer science is no more about computers than astronomy is about telescopes”

-- Edsger W. Dijkstra

“The computer is not our object of study, It’s our observational instrument”

Methodological Manifesto

Computer Science often:from (desired) property to (designed) object

In this talk: given a (very large & complex) object, what are its (observed) properties?

Not: “solving a problem”But: “answering a question”

Our object of study

&What to measure

Semantic Web in 5 principles1. Give all things a name2. Make a graph of relations between the thingsat this point we have (only) a Giant Graph3. Make sure all names are URIsat this point we have (only) a Giant Global Graph4. Add semantics (= predictable inference)

Examples of “semantics”

Semantics = predictable inference

Frank Lyndamarried-to

• Frank is male• married-to relates

males to females

• married-to relates 1 male to 1 female

• Lynda = Hazel

lowerbound upperbound

Hazelmarried-to

Did we get anywhere?

• Google = meaningful search

• NXP = data integration

• BBC = content re-use

• BestBuy = SEO (RDF-a)

• data.gov = data-publishing

Oracle DB, IBM DB2

Reuters,New York Times, Guardian

Sears, Kmart, OverStock, Volkswagen, Renault

GoodRelations ontology,schema.org

Yahoo, Bing

1 triple

How big is the Semantic Web?

~1010 Triples @ 1 triple/golfball

≈ 1 triple per web-page

Jupiter

Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 13 http://www.aifb.uni-karlsruhe.de/WBS

Observing at different scales

Distances weighted bynumber of links

What is this picture telling us?

• single connected component• Dense clusters with sparse interconnections• connectivity depends on a few nodes• the degree distribution

is highly skewed,• its structure varies

between aggregation levels.

What is this picture telling us?

• Does the meaning of a node depend on the cluster it appears in?

• Does path-length correlate with semantic distance?• Are highly connected nodes more certain?• Mutual influence of

low-level and high-levelstructure?

Logic?

Measuring what?• degree distribution, P(d(v)=n) or P(d(v)>n)• degree centrality: relative size of neighbourhood,

intuitive notion of connecivity, but only local• betweenness centrality:

fraction of all shortest paths that pass through a node,how essential is the node for connectivity, likelihood of being visited on a graphwalk

• closeness centrality1/average distance to all other nodeswhere to start for a graphwalk

• average shortest path lengthhelps to tune upperbound on graphwalks

• number of (strongly) connected components

Measuring When?20092014

Real phenomenon or measurement artefact?

Some first measurements

&their difficulties

Christophe Gueret

OK, let’s measure

• Billion Triple Challenge 2009

•WoD 2009•WoD 2010•BTC aggregated•BTC aggregated & intersected•sameAs aggregate

Non trivial decisions

OK, let’s measure

Degree distributionBTC BTC aggregated

This suggest power law distribution at different scales

OK, let’s measure

• Comparing WoD 2009 & 2010:increasing powerlaw behaviour.

• top 5 by degree centrality in sameAs-aggregatedPreferential attachment?

Dataset SameAs Degree centrality

Revyu.com 0.039

Semanticweb.org 0.037

Dbpedia.org 0.027

Data.semanticweb.org 0.019

www.deri.ie 0.017

This guy owns 4 out of these 5. Interesting socio-technical questions

But what should we measure?

• Treat sameAs nodes as single node?(semantically yes, pragmatically no?)

• Is connectedness meaningfull,instead of strongly connected?(semantically no, pragmatically yes?)

???????

And what are “good” values?

• Degree distribution should be powerlaw?(robust against random decay)

• Local clustering coefficient should be high?(strongly connected “topics”)

• Betweenness- impact of a sameAs-link should be high?(adds much extra information)

???????

How to builda WoD

observatory?

Wouter Beek

LOD Laundromat: clean your dirty triples

• crawl from registries (CKAN), by chasing URL's, user can submit URLs

• read multiple formats• clean syntax errors• remove duplicates• compute meta-data information• publish triples as JSON API & (meta-data) as SPARQL • harvest 1B triples/day

http://laundromat.org/

LOD Laundromat:

• 600.000 RDF files• 3,345,904,218 unique URLs• 5,319,790,836 literals

(not counting 6,699,148,542 integers, dates, etc)• 328Gb of zip’ed RDF• 1.2% of docs use owl:sameAs, 1.5% use OWL at all

http://laundromat.org/

From LOD Laundry to WoD observatory

• What facilities are needed?• Centralised or distributed?• Disagree mostly with Web Science Journal paper

???????

Graph structureas a proxy

for semantics

Laurens Rietveld

Hotspots in Knowledge Graps• Observation:

realistic queries only hit a small part of the data (< 2%)• DBPedia would need 500k queries to hit < 1%

• Non-trival to obtain these numbers

Dataset Size #queries Coverage

DBPedia 3.9 459M 1640 0.003%

Linked Geo Data 289M 81 1.917%

MetaLex 204M 4933 0.016%

Open-BioMed 79M 931 3.100%

Bio2RDF/KEGG 50M 1297 2.013%

SW Dog Food 240K 193 39.438%

Can graph-structure help us here?

• can we find the popular part of the graph without knowing the queries?

For subgraph G’ G, find the relevance function F(G',G): probability that for realistic query Q: a Q(G) a Q(G').

Can we approximate F by simple structural properties?= Can we use structure to predict semantic importance?

Experiment• Use graph-measures as selection thresholds

– indegree (easy)– outdegree (easy)– pagerank (doable, iterative)– betweenness centrality (hard)

Evaluate

Queries

Evaluation: exampleDataset

Subject Predicate Object Weight

:Laurens :bornIn :Amsterdam 0.6

:Amsterdam :capitalOf :NL 0.1

:Stefan :bornIn :Berlin 0.9

:Berlin :capitalOf :Germany 0.5

:Rinke :bornIn :Heerenveen 0.1

Triples from QuerySubject Predicate Object

:Laurens :bornIn :Amsterdam

:Amsterdam :capitalOf :NL

:Stefan :bornIn :Berlin

:Berlin :capitalOf :Germany

Which answers would we get with a sample of 60%?

QuerySELECT ?person ?country WHERE { ?person :bornIn ?city. ?city :capitalOf ?country}

Structural sampling: results

Why does this work so

unreasonably well?

Which methods work on

which types of graphs?

Logic?

Exploitingthe

graph structure(inconsistency)

Zhisheng Huang

46

General Idea

s(T,,0)s(T,,1)s(T,,2)

=def

Which selection function s(T,,n)?V1: symbol-distance

• S(T, ,0) = {}• S(T, ,1) = all concepts whose definition share

a symbol with the definition of • S(T, ,n) = all concepts whose definition share

a symbol with the definition of a concept in S(T, ,n-1)

Which selection function s(T,,n)?V2: concept-distance

• S(T, ,0) = {}• S(T, ,1) = all concepts whose definition share

a concept with the definition of • S(T, ,n) = all concepts whose definition share

a concept with the definition of a concept in S(T, ,n-1)

49

Evaluation

• “Graph-growing” gave a high quality sound approximation

Ontology #queries Unexpected Intended

MadCow+ 2594 0 93%

Communication 6576 0 96%

Transportation 6258 0 99%

Why does this work so

unreasonably well?

Which selection function s(T,,n)?V3: Google distance

wheref(x) is the number of Google hits for xf(x,y) is the number of Google hits for

the tuple of search items x and yM is the number of web pages indexed by Google

)}(log),(min{loglog

),(log)}(log),(max{log),(

yfxfM

yxfyfxfyxNGD

≈ symmetric conditional probability

of co-occurrence

≈ estimate of semantic distance

≈ symmetric conditional probability

of co-occurrence

≈ estimate of semantic distance

Google distance (NGD)

animal

sheep cow

madcow

vegeterian

plant

Google distance

This isn’t supposed to

work!

URIs are supposed to be meaningless..

Information contentof semantic graphs

Steven de Rooij

Compressability as an information measure

• 11111111111111111111 = 20x1• 11111000001111100000 = 5x1;5x0;5x1;5x0• 10110001011101010010 = ?• random = uncompressible

Depends on target language:• 11111000001111100000 = 2x(5x1;5x0)• 3.141592653589793238 =

Do URL’s encode meaning?

• Below horizontal: URI’s encode explicit class info• Left of vertical: URI’s encode implicit class infoBTW, this is 600.000 datapoints (RDF docs)

We need a semantics

that accounts for this!

Decompressability as an information measure

Nobody can predict these

numbers

Exploitingthe

graph structure(inference)

Kathrin Dentler

59/18

Inference by walking the graph• Swarm of micro-reasoners

• One rule per micro-reasoner

• Walk the graph, applying rules when possible

• Deduced facts disappear after some timeEvery author of apaper is a person

Every person is also an agent

60/18

Some early results• most of the

derivations are produced

• Lost: determinism, completenes

• Gained: anytime, coherent, prioritised

For which graphs does

this work well or not?

Closing:A call to all

Semantic Web researchers

A gazillion new open questions

don’t just try to build things, also try to understand things

don’t just ask how, also ask why

the web of data: do we actually understand what we built?

Science

web of data

semantic web

natural science

telescope science

connected nodes

object of study

degree distribution

semantic distance