the web of data: do we actually understand what we built?
TRANSCRIPT
Creative Commons CC BY 3.0: allowed to share & remix (also commercial)but must attribute
Frank van Harmelen
The Web of Data: do we actually understand
what we built?
(pssst: Our theory has fallen way behind our technology,our “why” is much weaker than our “how”)
Some expectation management
• Speculation• Questions• Hypotheses
If we knew what we were talking about, it
wouldn’t be called research
Intro:philosophical
stance
Computer Science should be like a natural science:studying objects in the information universe, and the laws that govern them.
And yes, I believe that the information universe exists and can be studied
Computer Science, = Telescope science ?
"Computer science is no more about computers than astronomy is about telescopes”
-- Edsger W. Dijkstra
“The computer is not our object of study, It’s our observational instrument”
Methodological Manifesto
Computer Science often:from (desired) property to (designed) object
In this talk: given a (very large & complex) object, what are its (observed) properties?
Not: “solving a problem”But: “answering a question”
Our object of study
&What to measure
Semantic Web in 5 principles1. Give all things a name2. Make a graph of relations between the thingsat this point we have (only) a Giant Graph3. Make sure all names are URIsat this point we have (only) a Giant Global Graph4. Add semantics (= predictable inference)
Examples of “semantics”
Semantics = predictable inference
Frank Lyndamarried-to
• Frank is male• married-to relates
males to females
• married-to relates 1 male to 1 female
• Lynda = Hazel
lowerbound upperbound
Hazelmarried-to
Did we get anywhere?
• Google = meaningful search
• NXP = data integration
• BBC = content re-use
• BestBuy = SEO (RDF-a)
• data.gov = data-publishing
Oracle DB, IBM DB2
Reuters,New York Times, Guardian
Sears, Kmart, OverStock, Volkswagen, Renault
GoodRelations ontology,schema.org
Yahoo, Bing
1 triple
How big is the Semantic Web?
~1010 Triples @ 1 triple/golfball
≈ 1 triple per web-page
Jupiter
Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 13 http://www.aifb.uni-karlsruhe.de/WBS
Observing at different scales
Distances weighted bynumber of links
What is this picture telling us?
• single connected component• Dense clusters with sparse interconnections• connectivity depends on a few nodes• the degree distribution
is highly skewed,• its structure varies
between aggregation levels.
What is this picture telling us?
• Does the meaning of a node depend on the cluster it appears in?
• Does path-length correlate with semantic distance?• Are highly connected nodes more certain?• Mutual influence of
low-level and high-levelstructure?
Logic?
Measuring what?• degree distribution, P(d(v)=n) or P(d(v)>n)• degree centrality: relative size of neighbourhood,
intuitive notion of connecivity, but only local• betweenness centrality:
fraction of all shortest paths that pass through a node,how essential is the node for connectivity, likelihood of being visited on a graphwalk
• closeness centrality1/average distance to all other nodeswhere to start for a graphwalk
• average shortest path lengthhelps to tune upperbound on graphwalks
• number of (strongly) connected components
Measuring When?20092014
Real phenomenon or measurement artefact?
Some first measurements
&their difficulties
Christophe Gueret
OK, let’s measure
• Billion Triple Challenge 2009
•WoD 2009•WoD 2010•BTC aggregated•BTC aggregated & intersected•sameAs aggregate
Non trivial decisions
OK, let’s measure
Degree distributionBTC BTC aggregated
This suggest power law distribution at different scales
OK, let’s measure
• Comparing WoD 2009 & 2010:increasing powerlaw behaviour.
• top 5 by degree centrality in sameAs-aggregatedPreferential attachment?
Dataset SameAs Degree centrality
Revyu.com 0.039
Semanticweb.org 0.037
Dbpedia.org 0.027
Data.semanticweb.org 0.019
www.deri.ie 0.017
This guy owns 4 out of these 5. Interesting socio-technical questions
But what should we measure?
• Treat sameAs nodes as single node?(semantically yes, pragmatically no?)
• Is connectedness meaningfull,instead of strongly connected?(semantically no, pragmatically yes?)
???????
And what are “good” values?
• Degree distribution should be powerlaw?(robust against random decay)
• Local clustering coefficient should be high?(strongly connected “topics”)
• Betweenness- impact of a sameAs-link should be high?(adds much extra information)
???????
How to builda WoD
observatory?
Wouter Beek
LOD Laundromat: clean your dirty triples
• crawl from registries (CKAN), by chasing URL's, user can submit URLs
• read multiple formats• clean syntax errors• remove duplicates• compute meta-data information• publish triples as JSON API & (meta-data) as SPARQL • harvest 1B triples/day
LOD Laundromat:
• 600.000 RDF files• 3,345,904,218 unique URLs• 5,319,790,836 literals
(not counting 6,699,148,542 integers, dates, etc)• 328Gb of zip’ed RDF• 1.2% of docs use owl:sameAs, 1.5% use OWL at all
From LOD Laundry to WoD observatory
• What facilities are needed?• Centralised or distributed?• Disagree mostly with Web Science Journal paper
???????
Graph structureas a proxy
for semantics
Laurens Rietveld
Hotspots in Knowledge Graps• Observation:
realistic queries only hit a small part of the data (< 2%)• DBPedia would need 500k queries to hit < 1%
• Non-trival to obtain these numbers
Dataset Size #queries Coverage
DBPedia 3.9 459M 1640 0.003%
Linked Geo Data 289M 81 1.917%
MetaLex 204M 4933 0.016%
Open-BioMed 79M 931 3.100%
Bio2RDF/KEGG 50M 1297 2.013%
SW Dog Food 240K 193 39.438%
Can graph-structure help us here?
• can we find the popular part of the graph without knowing the queries?
For subgraph G’ G, find the relevance function F(G',G): probability that for realistic query Q: a Q(G) a Q(G').
Can we approximate F by simple structural properties?= Can we use structure to predict semantic importance?
Experiment• Use graph-measures as selection thresholds
– indegree (easy)– outdegree (easy)– pagerank (doable, iterative)– betweenness centrality (hard)
Evaluate
Queries
Evaluation: exampleDataset
Subject Predicate Object Weight
:Laurens :bornIn :Amsterdam 0.6
:Amsterdam :capitalOf :NL 0.1
:Stefan :bornIn :Berlin 0.9
:Berlin :capitalOf :Germany 0.5
:Rinke :bornIn :Heerenveen 0.1
Triples from QuerySubject Predicate Object
:Laurens :bornIn :Amsterdam
:Amsterdam :capitalOf :NL
:Stefan :bornIn :Berlin
:Berlin :capitalOf :Germany
Which answers would we get with a sample of 60%?
QuerySELECT ?person ?country WHERE { ?person :bornIn ?city. ?city :capitalOf ?country}
Structural sampling: results
Why does this work so
unreasonably well?
Which methods work on
which types of graphs?
Logic?
Exploitingthe
graph structure(inconsistency)
Zhisheng Huang
46
General Idea
s(T,,0)s(T,,1)s(T,,2)
=def
Which selection function s(T,,n)?V1: symbol-distance
• S(T, ,0) = {}• S(T, ,1) = all concepts whose definition share
a symbol with the definition of • S(T, ,n) = all concepts whose definition share
a symbol with the definition of a concept in S(T, ,n-1)
Which selection function s(T,,n)?V2: concept-distance
• S(T, ,0) = {}• S(T, ,1) = all concepts whose definition share
a concept with the definition of • S(T, ,n) = all concepts whose definition share
a concept with the definition of a concept in S(T, ,n-1)
49
Evaluation
• “Graph-growing” gave a high quality sound approximation
Ontology #queries Unexpected Intended
MadCow+ 2594 0 93%
Communication 6576 0 96%
Transportation 6258 0 99%
Why does this work so
unreasonably well?
Which selection function s(T,,n)?V3: Google distance
wheref(x) is the number of Google hits for xf(x,y) is the number of Google hits for
the tuple of search items x and yM is the number of web pages indexed by Google
)}(log),(min{loglog
),(log)}(log),(max{log),(
yfxfM
yxfyfxfyxNGD
≈ symmetric conditional probability
of co-occurrence
≈ estimate of semantic distance
≈ symmetric conditional probability
of co-occurrence
≈ estimate of semantic distance
Google distance (NGD)
animal
sheep cow
madcow
vegeterian
plant
Google distance
This isn’t supposed to
work!
URIs are supposed to be meaningless..
Information contentof semantic graphs
Steven de Rooij
Compressability as an information measure
• 11111111111111111111 = 20x1• 11111000001111100000 = 5x1;5x0;5x1;5x0• 10110001011101010010 = ?• random = uncompressible
Depends on target language:• 11111000001111100000 = 2x(5x1;5x0)• 3.141592653589793238 =
Do URL’s encode meaning?
• Below horizontal: URI’s encode explicit class info• Left of vertical: URI’s encode implicit class infoBTW, this is 600.000 datapoints (RDF docs)
We need a semantics
that accounts for this!
Decompressability as an information measure
Nobody can predict these
numbers
Exploitingthe
graph structure(inference)
Kathrin Dentler
59/18
Inference by walking the graph• Swarm of micro-reasoners
• One rule per micro-reasoner
• Walk the graph, applying rules when possible
• Deduced facts disappear after some timeEvery author of apaper is a person
Every person is also an agent
60/18
Some early results• most of the
derivations are produced
• Lost: determinism, completenes
• Gained: anytime, coherent, prioritised
For which graphs does
this work well or not?
Closing:A call to all
Semantic Web researchers
A gazillion new open questions
don’t just try to build things, also try to understand things
don’t just ask how, also ask why