exploring linked data content through network analysis

35
Exploring Linked Data content through network analysis Christophe Guéret (@cgueret) Free University Amsterdam Co-explorers: Stefan Schlobach, Shenghui Wang, Paul Groth, Frank van Harmelen http://latc-project.eu http://www.vu.nl

Upload: christophe-gueret

Post on 16-May-2015

2.584 views

Category:

Technology


1 download

DESCRIPTION

Presentation given at a seminar in Yahoo.

TRANSCRIPT

Page 1: Exploring Linked Data content through network analysis

Exploring Linked Data contentthrough network analysis

Christophe Guéret (@cgueret)Free University Amsterdam

Co-explorers: Stefan Schlobach, Shenghui Wang, Paul Groth, Frank van Harmelen

http://latc-project.eu http://www.vu.nl

Page 2: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 2/35

Outline of the talk

What is Linked Data?

What is there is to be analysed?

Do we miss something?

New research directions and first results

Page 3: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 3/35

Linked Data

http://www.flickr.com/photos/erikcharlton/3337465138

Linked Data (aka Semantic Web)

Page 4: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 4/35

What is the problem?Frank and Christophe publish some open data

Roi wants to combine and enrich it

Marvel icons: mermer, DeviantArt

Kennissen Staad

Christophe Amsterdam

Peter Barcelona

David ParijsFrank

Ville Pays

Barcelone Espagne

Paris France

Amsterdam Pays-BasChristophe

Roi

WWW

WWW

Page 5: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 5/35

What is the problem?

Data integration issue

“Kennissen”, “Staad”, “Ville”, “Pays” ?

“Paris” = “Parijs” ?

“Amsterdam” = “Amsterdam” ?

Lot of work, must be done again on updates

Kennissen Staad

Christophe Amsterdam

Peter Barcelona

David Parijs

Ville Pays

Barcelone Espagne

Paris France

Amsterdam Pays-Bas

+ = ?

Page 6: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 6/35

A solution

Do data integration at the data level

Use, and re-use, unambiguous identifiers

Use meta-level descriptions of the identifiers

Proposal: use the Web as a platform

Identifiers = URIs

Descriptions = de-referenced documents

Page 7: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 7/35

ex:Acquaintance

ex:Christophe ex:Peter ex:David

dbpedia:Amsterdam dbpedia:Barcelona dbpedia:Paris

ex:worksIn ex:worksIn ex:worksIn

rdf:type rdf:type rdf:type

Frank publishes his data Kennissen Staad

Christophe Amsterdam

Peter Barcelona

David Parijs

Use of compact URIsdbpedia = http://dbpedia.org/resource/ex = http://example.org/rdf = http://www.w3.org/1999/02/22-rdf-syntax-ns#

This is a “triple”

Page 8: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 8/35

Christophe re-use part of Frank's data to publish his data

ex:Acquaintance

ex:Christophe ex:Peter ex:David

dbpedia:Amsterdam dbpedia:Barcelona dbpedia:Paris

dbpedia:Netherlands dbpedia:Spain dbpedia:France

ex:worksIn ex:worksIn

ex:isIn ex:isIn

ex:worksIn

ex:isIn

rdf:type rdf:type rdf:type

Ville Pays

Barcelone Espagne

Paris France

Amsterdam Pays-Bas

Page 9: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 9/35

Roi add some more information

ex:Acquaintance

ex:Christophe ex:Peter ex:David

dbpedia:Amsterdam dbpedia:Barcelona dbpedia:Paris

dbpedia:Netherlands dbpedia:Spain dbpedia:France

dbpedia:Europe

ex:worksIn ex:worksIn

ex:isIn ex:isIn

ex:worksIn

ex:isIn

ex:isInex:isInex:isIn

rdf:type rdf:type rdf:type

“Conocido”@es

rdf:label

Page 10: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 10/35

dbpedia:Amsterdam

Page 11: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 11/35

Reasoning with Semantics Bonus!

dbpedia:Netherlands

dbpedia:Europe

ex:isIn

dbpedia:Amsterdam

ex:isIn

ex:isIn

owl:TransitiveProperty

rdf:type

+ =

dbpedia:Europe

ex:isIn

dbpedia:Amsterdam

Example usage

Materialize implicit information

Check for consistency

Page 12: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 12/35

Rough estimate of size

295 data sets, 31B facts in LOD Cloud

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Page 13: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 13/35

Lots of Data to analyze! :-)

http://www.flickr.com/photos/argonne/3323018571

Page 14: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 14/35

But analyzing what exactly?

Table of facts published at different locations

A distributed Knowledge BaseSubject Predicate Object

ex:Christophe rdf:type ex:Acquaintance

ex:Christophe ex:worksIn dbpedia:Amsterdam

ex:Peter rdf:type ex:Acquaintance

... ... ...

Subject Predicate Object

dbpedia:Amsterdam ex:isIn dbpedia:Netherlands

dbpedia:Netherlands ex:isIn dbpedia:Europe

... ... ...

Subject Predicate Object

ex:Acquaintance rdf:label “Conocido”@es

... ... ...

Page 15: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 15/35

Analysis workflow

1.Gather a snapshot of triples

2.Compute descriptive statistics

Top resources (subject, predicate, object)

Frequency cross-links types (SP,SO,PO,...)

Connected components

Paths frequency

=> Tricky enough, the data is really big!

=> We should be able to get more out of the data

Page 16: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 16/35

Can we explain that?

Suggestions

Started the graph

General knowledge

Very well known

Page 17: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 17/35

or that?

Suggestions

All published by Bio2RDF

Well aware of each other

Overlapping domain

Page 18: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 18/35

Could we predict the impact of ...

Dbpedia being down for a while ?

SIOC renaming “User” into “UserAccount” ?

creating a dataset that turns out to be popular ?

Analysing a set of triples is not enough

Page 19: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 19/35

Are we overlooking something?

Page 20: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 20/35

It's not only about the resources

Several entities related to the data

Data publishers/consumers Resources Web servers

ex:something

Interactions between all of them

WWW

WWW

Page 21: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 21/35

There are different scales

Triples level versus Resource groups level

Different data complexity at each scale

ex:Acquaintance

ex:Christophe ex:Peter ex:David

dbpedia:Amsterdam dbpedia:Barcelona dbpedia:Paris

dbpedia:Netherlands dbpedia:Spain dbpedia:France

dbpedia:Europe

ex:worksIn ex:worksIn

ex:isIn ex:isIn

ex:worksIn

ex:isIn

ex:isInex:isInex:isIn

rdf:type rdf:type rdf:type

“Conocido”@es

rdf:label

Page 22: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 22/35

It is not a static network

Size and topology evolve over time

2007 2008 2010

Page 23: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 23/35

Linked Data is a Complex System

Multiple scale of observation

Emergence of properties

The whole is more than the sum of the parts

=> Interactions/relations are important to understand the system behavior

=> We can benefit from a large body of research results in Complex Systems study

Page 24: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 24/35

Initial findings and future work

Ya3hs3/2531493704 on Flickr

Page 25: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 25/35

New analysis workflow

1.Gather a snapshot of triples

2.Gather information about other type of interactions

3.Create specific networks related to the research questions at hand

4.Run metrics, interpret results

Page 26: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 26/35

The LOD is not what we think it is

LOD Cloud 2009/2010 vs BTC 2009 crawl

Crawled sample differs from the community based view

LOD Cloud has lumpy structure

Evolution of LOD Cloud

centrality changes

Increased density and connectivityChristophe Guéret, Shenghui Wang, Paul Groth et al. (2011)

Multi-scale Analysis of the Web Of Data: A Challenge to the Complex System's CommunityAdvances in Complex Systems 14 (04)

Page 27: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 27/35

Page 28: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 28/35

The tools we need don't exist

We need to flatten the networks to study them

Some specific aspects of the system

Existence of implicit links

Multi-relational and dynamic

Distributed

Hypergraph of relations

Christophe Guéret, Shenghui Wang, Paul Groth et al. (2011)Multi-scale Analysis of the Web Of Data: A Challenge to the Complex System's Community

Advances in Complex Systems 14 (04)

Page 29: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 29/35

Influence content<->social networks

Generate and bind two networks

Measure evolution of degree, betweenness, clustering over time

Predict evolutionShenghui Wang, Paul Groth (2010)

Measuring the dynamic bi-directional influence between content and social networksProceedings of the 9th International Semantic Web Conference (ISWC2010)

ex:a

ex:c

ex:b

Page 30: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 30/35

Result for conferences

Shenghui Wang, Paul Groth (2010)Measuring the dynamic bi-directional influence between content and social networks

Proceedings of the 9th International Semantic Web Conference (ISWC2010)

Page 31: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 31/35

Centrality to measure robustness

Map the BTC2010 to two networks

Semantic network based on namespaces

Host networks based on hostnames

Measure robustness as the variance in betweenness centrality

Find weak spots

Optimize networks to increase robustnessChristophe Guéret, Paul Groth, Frank Van Harmelen et al. (2010)

Finding the Achilles Heel of the Web of Data : using network analysis for link-recommendationProceedings of the 9th International Semantic Web Conference (ISWC2010)

Page 32: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 32/35

Results on hostnames

Christophe Guéret, Paul Groth, Frank Van Harmelen et al. (2010)Finding the Achilles Heel of the Web of Data : using network analysis for link-recommendation

Proceedings of the 9th International Semantic Web Conference (ISWC2010)

Page 33: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 33/35

Results on namespaces

Christophe Guéret, Paul Groth, Frank Van Harmelen et al. (2010)Finding the Achilles Heel of the Web of Data : using network analysis for link-recommendation

Proceedings of the 9th International Semantic Web Conference (ISWC2010)

Page 34: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 34/35

Improving the network

Christophe Guéret, Paul Groth, Frank Van Harmelen et al. (2010)Finding the Achilles Heel of the Web of Data : using network analysis for link-recommendation

Proceedings of the 9th International Semantic Web Conference (ISWC2010)

Page 35: Exploring Linked Data content through network analysis

November 23, 2011 Analysis of Linked Data 35/35

Conclusion

Take home message

Linked Data is not a simple knowledge base

Network analysis tools give new insights on the data

Results can be used to improve the network

Future work

Make resource-centric analysis rather than graph-centric analysis (big bottleneck now)

Tackle the time aspect of the data

Find more analysis to perform and what they tell us