the web of data: do we actually understand what we built?

56
reative Commons CC BY 3.0: llowed to share & remix also commercial) ut must attribute Frank van Harmelen The Web of Data: do we actually understand what we built? (pssst: Our theory has fallen way behind our technology, our “why” is much weaker than our “how”)

Upload: frank-van-harmelen

Post on 13-Aug-2015

1.014 views

Category:

Science


0 download

TRANSCRIPT

Page 1: The Web of Data: do we actually understand what we built?

Creative Commons CC BY 3.0: allowed to share & remix (also commercial)but must attribute

Frank van Harmelen

The Web of Data: do we actually understand

what we built?

(pssst: Our theory has fallen way behind our technology,our “why” is much weaker than our “how”)

Page 2: The Web of Data: do we actually understand what we built?

Some expectation management

• Speculation• Questions• Hypotheses

If we knew what we were talking about, it

wouldn’t be called research

Page 3: The Web of Data: do we actually understand what we built?

Intro:philosophical

stance

Page 4: The Web of Data: do we actually understand what we built?

Computer Science should be like a natural science:studying objects in the information universe, and the laws that govern them.

And yes, I believe that the information universe exists and can be studied

Page 5: The Web of Data: do we actually understand what we built?

Computer Science, = Telescope science ?

"Computer science is no more about computers than astronomy is about telescopes”

-- Edsger W. Dijkstra

“The computer is not our object of study, It’s our observational instrument”

Page 6: The Web of Data: do we actually understand what we built?

Methodological Manifesto

Computer Science often:from (desired) property to (designed) object

In this talk: given a (very large & complex) object, what are its (observed) properties?

Not: “solving a problem”But: “answering a question”

Page 7: The Web of Data: do we actually understand what we built?
Page 8: The Web of Data: do we actually understand what we built?

Our object of study

&What to measure

Page 9: The Web of Data: do we actually understand what we built?

Semantic Web in 5 principles1. Give all things a name2. Make a graph of relations between the thingsat this point we have (only) a Giant Graph3. Make sure all names are URIsat this point we have (only) a Giant Global Graph4. Add semantics (= predictable inference)

Page 10: The Web of Data: do we actually understand what we built?

Examples of “semantics”

Semantics = predictable inference

Frank Lyndamarried-to

• Frank is male• married-to relates

males to females

• married-to relates 1 male to 1 female

• Lynda = Hazel

lowerbound upperbound

Hazelmarried-to

Page 11: The Web of Data: do we actually understand what we built?

Did we get anywhere?

• Google = meaningful search

• NXP = data integration

• BBC = content re-use

• BestBuy = SEO (RDF-a)

• data.gov = data-publishing

Oracle DB, IBM DB2

Reuters,New York Times, Guardian

Sears, Kmart, OverStock, Volkswagen, Renault

GoodRelations ontology,schema.org

Yahoo, Bing

Page 12: The Web of Data: do we actually understand what we built?

1 triple

How big is the Semantic Web?

Page 13: The Web of Data: do we actually understand what we built?

~1010 Triples @ 1 triple/golfball

≈ 1 triple per web-page

Jupiter

Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 13 http://www.aifb.uni-karlsruhe.de/WBS

Page 14: The Web of Data: do we actually understand what we built?

Observing at different scales

Page 15: The Web of Data: do we actually understand what we built?
Page 16: The Web of Data: do we actually understand what we built?
Page 17: The Web of Data: do we actually understand what we built?

Distances weighted bynumber of links

Page 18: The Web of Data: do we actually understand what we built?

What is this picture telling us?

• single connected component• Dense clusters with sparse interconnections• connectivity depends on a few nodes• the degree distribution

is highly skewed,• its structure varies

between aggregation levels.

Page 19: The Web of Data: do we actually understand what we built?

What is this picture telling us?

• Does the meaning of a node depend on the cluster it appears in?

• Does path-length correlate with semantic distance?• Are highly connected nodes more certain?• Mutual influence of

low-level and high-levelstructure?

Logic?

Page 20: The Web of Data: do we actually understand what we built?

Measuring what?• degree distribution, P(d(v)=n) or P(d(v)>n)• degree centrality: relative size of neighbourhood,

intuitive notion of connecivity, but only local• betweenness centrality:

fraction of all shortest paths that pass through a node,how essential is the node for connectivity, likelihood of being visited on a graphwalk

• closeness centrality1/average distance to all other nodeswhere to start for a graphwalk

• average shortest path lengthhelps to tune upperbound on graphwalks

• number of (strongly) connected components

Page 21: The Web of Data: do we actually understand what we built?

Measuring When?20092014

Real phenomenon or measurement artefact?

Page 22: The Web of Data: do we actually understand what we built?

Some first measurements

&their difficulties

Christophe Gueret

Page 23: The Web of Data: do we actually understand what we built?

OK, let’s measure

• Billion Triple Challenge 2009

•WoD 2009•WoD 2010•BTC aggregated•BTC aggregated & intersected•sameAs aggregate

Non trivial decisions

Page 24: The Web of Data: do we actually understand what we built?

OK, let’s measure

Degree distributionBTC BTC aggregated

This suggest power law distribution at different scales

Page 25: The Web of Data: do we actually understand what we built?

OK, let’s measure

• Comparing WoD 2009 & 2010:increasing powerlaw behaviour.

• top 5 by degree centrality in sameAs-aggregatedPreferential attachment?

Dataset SameAs Degree centrality

Revyu.com 0.039

Semanticweb.org 0.037

Dbpedia.org 0.027

Data.semanticweb.org 0.019

www.deri.ie 0.017

This guy owns 4 out of these 5. Interesting socio-technical questions

Page 26: The Web of Data: do we actually understand what we built?

But what should we measure?

• Treat sameAs nodes as single node?(semantically yes, pragmatically no?)

• Is connectedness meaningfull,instead of strongly connected?(semantically no, pragmatically yes?)

???????

Page 27: The Web of Data: do we actually understand what we built?

And what are “good” values?

• Degree distribution should be powerlaw?(robust against random decay)

• Local clustering coefficient should be high?(strongly connected “topics”)

• Betweenness- impact of a sameAs-link should be high?(adds much extra information)

???????

Page 28: The Web of Data: do we actually understand what we built?

How to builda WoD

observatory?

Wouter Beek

Page 29: The Web of Data: do we actually understand what we built?

LOD Laundromat: clean your dirty triples

• crawl from registries (CKAN), by chasing URL's, user can submit URLs

• read multiple formats• clean syntax errors• remove duplicates• compute meta-data information• publish triples as JSON API & (meta-data) as SPARQL • harvest 1B triples/day

Page 30: The Web of Data: do we actually understand what we built?

LOD Laundromat:

• 600.000 RDF files• 3,345,904,218 unique URLs• 5,319,790,836 literals

(not counting 6,699,148,542 integers, dates, etc)• 328Gb of zip’ed RDF• 1.2% of docs use owl:sameAs, 1.5% use OWL at all

Page 31: The Web of Data: do we actually understand what we built?

From LOD Laundry to WoD observatory

• What facilities are needed?• Centralised or distributed?• Disagree mostly with Web Science Journal paper

???????

Page 32: The Web of Data: do we actually understand what we built?

Graph structureas a proxy

for semantics

Laurens Rietveld

Page 33: The Web of Data: do we actually understand what we built?

Hotspots in Knowledge Graps• Observation:

realistic queries only hit a small part of the data (< 2%)• DBPedia would need 500k queries to hit < 1%

• Non-trival to obtain these numbers

Dataset Size #queries Coverage

DBPedia 3.9 459M 1640 0.003%

Linked Geo Data 289M 81 1.917%

MetaLex 204M 4933 0.016%

Open-BioMed 79M 931 3.100%

Bio2RDF/KEGG 50M 1297 2.013%

SW Dog Food 240K 193 39.438%

Page 34: The Web of Data: do we actually understand what we built?

Can graph-structure help us here?

• can we find the popular part of the graph without knowing the queries?

For subgraph G’ G, find the relevance function F(G',G): probability that for realistic query Q: a Q(G) a Q(G').

Can we approximate F by simple structural properties?= Can we use structure to predict semantic importance?

Page 35: The Web of Data: do we actually understand what we built?

Experiment• Use graph-measures as selection thresholds

– indegree (easy)– outdegree (easy)– pagerank (doable, iterative)– betweenness centrality (hard)

Evaluate

Queries

Page 36: The Web of Data: do we actually understand what we built?

Evaluation: exampleDataset

Subject Predicate Object Weight

:Laurens :bornIn :Amsterdam 0.6

:Amsterdam :capitalOf :NL 0.1

:Stefan :bornIn :Berlin 0.9

:Berlin :capitalOf :Germany 0.5

:Rinke :bornIn :Heerenveen 0.1

Triples from QuerySubject Predicate Object

:Laurens :bornIn :Amsterdam

:Amsterdam :capitalOf :NL

:Stefan :bornIn :Berlin

:Berlin :capitalOf :Germany

Which answers would we get with a sample of 60%?

QuerySELECT ?person ?country WHERE { ?person :bornIn ?city. ?city :capitalOf ?country}

Page 37: The Web of Data: do we actually understand what we built?

Structural sampling: results

Why does this work so

unreasonably well?

Which methods work on

which types of graphs?

Logic?

Page 38: The Web of Data: do we actually understand what we built?

Exploitingthe

graph structure(inconsistency)

Zhisheng Huang

Page 39: The Web of Data: do we actually understand what we built?

46

General Idea

s(T,,0)s(T,,1)s(T,,2)

=def

Page 40: The Web of Data: do we actually understand what we built?

Which selection function s(T,,n)?V1: symbol-distance

• S(T, ,0) = {}• S(T, ,1) = all concepts whose definition share

a symbol with the definition of • S(T, ,n) = all concepts whose definition share

a symbol with the definition of a concept in S(T, ,n-1)

Page 41: The Web of Data: do we actually understand what we built?

Which selection function s(T,,n)?V2: concept-distance

• S(T, ,0) = {}• S(T, ,1) = all concepts whose definition share

a concept with the definition of • S(T, ,n) = all concepts whose definition share

a concept with the definition of a concept in S(T, ,n-1)

Page 42: The Web of Data: do we actually understand what we built?

49

Evaluation

• “Graph-growing” gave a high quality sound approximation

Ontology #queries Unexpected Intended

MadCow+ 2594 0 93%

Communication 6576 0 96%

Transportation 6258 0 99%

Why does this work so

unreasonably well?

Page 43: The Web of Data: do we actually understand what we built?

Which selection function s(T,,n)?V3: Google distance

wheref(x) is the number of Google hits for xf(x,y) is the number of Google hits for

the tuple of search items x and yM is the number of web pages indexed by Google

)}(log),(min{loglog

),(log)}(log),(max{log),(

yfxfM

yxfyfxfyxNGD

≈ symmetric conditional probability

of co-occurrence

≈ estimate of semantic distance

≈ symmetric conditional probability

of co-occurrence

≈ estimate of semantic distance

Page 44: The Web of Data: do we actually understand what we built?

Google distance (NGD)

animal

sheep cow

madcow

vegeterian

plant

Page 45: The Web of Data: do we actually understand what we built?

Google distance

This isn’t supposed to

work!

Page 46: The Web of Data: do we actually understand what we built?

URIs are supposed to be meaningless..

Page 47: The Web of Data: do we actually understand what we built?

Information contentof semantic graphs

Steven de Rooij

Page 48: The Web of Data: do we actually understand what we built?

Compressability as an information measure

• 11111111111111111111 = 20x1• 11111000001111100000 = 5x1;5x0;5x1;5x0• 10110001011101010010 = ?• random = uncompressible

Depends on target language:• 11111000001111100000 = 2x(5x1;5x0)• 3.141592653589793238 =

Page 49: The Web of Data: do we actually understand what we built?

Do URL’s encode meaning?

• Below horizontal: URI’s encode explicit class info• Left of vertical: URI’s encode implicit class infoBTW, this is 600.000 datapoints (RDF docs)

We need a semantics

that accounts for this!

Page 50: The Web of Data: do we actually understand what we built?

Decompressability as an information measure

Nobody can predict these

numbers

Page 51: The Web of Data: do we actually understand what we built?

Exploitingthe

graph structure(inference)

Kathrin Dentler

Page 52: The Web of Data: do we actually understand what we built?

59/18

Inference by walking the graph• Swarm of micro-reasoners

• One rule per micro-reasoner

• Walk the graph, applying rules when possible

• Deduced facts disappear after some timeEvery author of apaper is a person

Every person is also an agent

Page 53: The Web of Data: do we actually understand what we built?

60/18

Some early results• most of the

derivations are produced

• Lost: determinism, completenes

• Gained: anytime, coherent, prioritised

For which graphs does

this work well or not?

Page 54: The Web of Data: do we actually understand what we built?

Closing:A call to all

Semantic Web researchers

Page 55: The Web of Data: do we actually understand what we built?
Page 56: The Web of Data: do we actually understand what we built?

A gazillion new open questions

don’t just try to build things, also try to understand things

don’t just ask how, also ask why